EticaAI / HXL-Data-Science-file-formats

Common file formats used for Data Science and language localization exported from (and to) HXL (The Humanitarian Exchange Language)
https://hdp.etica.ai/
The Unlicense
3 stars 1 forks source link

`urnresolver`: Uniform Resource Names - URN Resolver #13

Open fititnt opened 3 years ago

fititnt commented 3 years ago

Quick links


Captura de tela de 2021-03-05 09-36-02


"A Uniform Resource Name (URN) is a Uniform Resource Identifier (URI) that uses the urn scheme. URNs are globally unique persistent identifiers assigned within defined namespaces so they will be available for a long period of time, even after the resource which they identify ceases to exist or becomes unavailable.[1] URNs cannot be used to directly locate an item and need not be resolvable, as they are simply templates that another parser may use to find an item." -- Wikipedia

As part of reference the datasets (temporary internal name: hdataset) from different groups (temporary internal name: hsilo) makes sense to have some way to padronize naming. And URNs, even if is complicated to implement in practice, at least could serve as hint for humans simply avoid using whatever is their creative idea at the moment. (This actually is more important if we're implementing localized translations as part of the [meta issue] hxlm #11 with equal equivalent between translations).

fititnt commented 3 years ago

Because of this topic, we will need to create some sort of local vault for permanent storage.

Captura de tela de 2021-03-05 15-58-25

One idea about the namespace urn:data: is, while some more complex namespaces may actually do whatever they want (including using full unicode), we could have some base functionality to for an query like urnresolver urn:data:un:locode if already have files on local computer, return the exact URI "$HOME/.config/hxlm/urn/data/un/locode/locode.csv" instead of return error and suggest the documentation https://unece.org/trade/uncefact/unlocode.

I think that in fact, instead of "return error" if the user does not force return error, but allow the urnresolver return ANOTHER urn (like urn:data-i:un:locode), and that urn would return the information like https://unece.org/trade/uncefact/unlocode, this could help with direct usage via command line.

Note: I know that urn:data:un:locode could "ideally" be something like urn:data:un:unece:locode, but the "UN/LOCODE" is so famous, that could worth the idea of make some types of aliases.

fititnt commented 3 years ago

This is the current result. As for baseline URN processing strategy (likely to be the "organization" inside an already namespaced country/territory) could be both an single identifier or (since I'm not sure if most people in the middle of urgency would agree with something) then use an domain name itself.

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ ./tests/test_core_urn.py
DataUrnHtype(value='urn:data--i:un:locode') {'nid': 'data', 'nid_attr': 'i', 'bpgp': 'un', 'bpln': 'locode', 'nss': 'un:locode'}
DataUrnHtype(value='URN:DATA--I:UN:LOCODE') {'nid': 'data', 'nid_attr': 'i', 'bpgp': 'URN', 'bpln': 'DATA--I', 'nss': 'URN:DATA--I:UN:LOCODE'}
DataUrnHtype(value='urn:data:un:locode') {'nid': 'data', 'nid_attr': 'd', 'bpgp': 'un', 'bpln': 'locode', 'nss': 'un:locode'}
DataUrnHtype(value='urn:data--i:xz:hxlcplp:fod:bool') {'nid': 'data', 'nid_attr': 'i', 'bpgp': 'xz', 'bpln': 'hxlcplp', 'nss': 'xz:hxlcplp:fod:bool'}
DataUrnHtype(value='urn:data:br:__saude.gov.br__:covid-19-vacinacao') {'nid': 'data', 'nid_attr': 'd', 'bpgp': 'br', 'bpln': 'saude.gov.br', 'bpln_isdn': True, 'nss': 'br:__saude.gov.br__:covid-19-vacinacao'}
DataUrnHtype(value='urn:data--i:cn:__中国.icom.museum__:test') {'nid': 'data', 'nid_attr': 'i', 'bpgp': 'cn', 'bpln': '中国.icom.museum', 'bpln_isdn': True, 'nss': 'cn:__中国.icom.museum__:test'}
DataUrnHtype(value='urn:data--i:ru:__россия.иком.museum__:test') {'nid': 'data', 'nid_attr': 'i', 'bpgp': 'ru', 'bpln': 'россия.иком.museum', 'bpln_isdn': True, 'nss': 'ru:__россия.иком.museum__:test'}
DataUrnHtype(value='urn:data--i:eg:__مصر.icom.museum__:test') {'nid': 'data', 'nid_attr': 'i', 'bpgp': 'eg', 'bpln': 'مصر.icom.museum', 'bpln_isdn': True, 'nss': 'eg:__مصر.icom.museum__:test'}

The idea is the urnresolver be able to (if do exist one already prepared dataset on a path available on local filesystem) based on most comon URNs even if implementers do not create something very specific for the country or the organuzation, at least the default strategy would allow people working with datasets some place to put the files.

If the default is good enough, while documentations could always require the humans manually translate, at least the default resolver could make documentations directly usable!

fititnt commented 3 years ago

The current version of HXL-Data-Science-file-formats is v0.7.3.

Captura de tela de 2021-03-08 10-12-13

I think that the tools for URN resolving worth an different group from HXL2 topic. In fact, the URN resolving often would be applied to content that still not HXLated yet or deal with issues also related to HXL, like very sensitive content (like how to name URLs that may be protected just by randomness, not by access control?)

fititnt commented 3 years ago

We already have a proof of concept of the command line interface for the internals of HXLm related to URN parsing, so we can at least mark this topic as with some proof of concept.

The version will be updated to v0.7.4 soon so in the worst case scenario (like we from EticaAI/HXL-CPLP don't go forward with this, but people could reuse work several years later) is possible to look around this date. Anyway all the things are public domain dedication.

As of the current "official" release, only v0.7.0 was published. The problem is so many new features from v0.7.0 that I would not even know where to start documenting without breaking things in complete separate projects. But at the moment it is not 100% clear what is worth focus and involve beyond early drafts (even if they are usable).

Relation with the [meta issue] hxlm #11

While most features of HXLm (temporary name) library are to help use HXL also as a common format to convert from/to common data science formats, the urnresolver, if do have some minimal functionality, could still be used without need to keep updated with the latest version of this full repository. Also if this project is not broken in smaller parts by EticaAI or HXL-CPLP initiative, if this starts to get used in production, it may make sense just extract the smaller parts, even if renaming classes to avoid confusion.

But in case of actually being used, I think it is worth paying more attention to the way to document URNs index files (the file formats, the way to share in private, etc) than the code itself that converts then to usable URLs.

But I still think that it is very, very important that any implementation at bare minimum allows converting an URN string to an URL or local path that could be used by HXL command line tools (or, for solid files, like encrypted it compressed, any other tool on the machine).

Note: one workaround to convert URNs to URLs would be to prefix the URNs to http:// and override /etc/hosts file to some server that de facto redirect to the real URL. Also while the urnresolver is designed to be usable by a human on a local computer it actually can be used to convert URNs for others inside one already secured intranet. Maybe with libhxl-python CLI tools actually become an private provider of data (either by retrieving remote resources or by local files on disk).

fititnt commented 3 years ago

Dynamic Delegation Discovery System (DDDS) & Name Authority Pointer (NAPTR)

Context

We're doing urnresolver using static files, but what some time ago I imagined that could be implemented with DNS, actually was not only not new, but planned more than 20 years ago. The idea of Dynamic Delegation Discovery System (DDDS) seems actually very focused on Regexes, so as long as URNs are somewhat standarlized, the thing could be reused again.

And about private/restricted access

Maybe I missed, but the experimental RFCs related to Dynamic Delegation Discovery System (DDDS) did not mentioned some way to authenticate user (or at least don't leak information).

Air gapped

Maybe what we could do is both have a public end point, and then allow have private resolvers. But instead of overly complicate, we just optimize the full thing to be able to air-gapped (means have local cache, deal with copy and paste data from external world to internal network, etc) and somewhat allow even a local resolver just work from static files.

Both public and private

The intermediate case (a resolver from someone that both work locally and then fallback to public resolver) may be tricky on context of privacy.

For example, if we from Etica.AI actually release a public endpoint, the way internet works, means that if user ask a request, the resolver may travel entire internet and hit that request to both CloudFlare (and GitHub). Even if we dont have this request, and the fact that all information actually is public, this could leak information that someone requested something.

Maybe if we eventually release one public endpoint could make sense that we try to find sites that are equivalent to GitHub on other regions of the world.


Edit: Gitee (alternative popular on zh-CN zn-TW) have this article explaining that is possible to host Pages https://gitee.com/help/articles/4136. Actually even explain how to copy/migrate content from GitHub. Also seems that this URL here explain how (not sure if Gitee or other service) could violate and be banned https://gitee.com/o2team/Taro/pages. Seems actually very similar to what would by hosting in any other place, including the point about respect national laws and regulations.

fititnt commented 3 years ago

Humm... it's something.

Captura de tela de 2021-04-23 02-42-02

Ok that this is not that useful since we don't actually save cache on disk yet. But since there is at least 3 ways to process URNs:

I think that anything that could not fit ok on standanone shell or python scripts, instead of 1000 lines long scripts, we just put the work inside the hdp-toolchain.

fititnt commented 3 years ago

I believe we should add a few more parameters on the urn.yml files. All other options are new ones (most of then already are not implemented parameters on the current urnresolver cli toon to expose these features, but this could be done soon.

One change is the source now is fontem.

The urn.yml format

Old format

Example 1

# Trivia:
#  - "fontem"
#    - https://en.wiktionary.org/wiki/fons#Latin
#  - "auxilium"
#    - https://en.wiktionary.org/wiki/auxilium#Latin
#  - "dēscrīptiōnem"
#    - https://en.wiktionary.org/wiki/descriptio#Latin
#  - "explānandum"
#    - https://en.wiktionary.org/wiki/explano#Latin

- urn: "urn:data:xz:hxl:standard:core:hashtag"
  descriptionem:
    eng-Latn: HXL/CSV version of the HXL Standard core hashtags.
  auxilium:
    - https://data.humdata.org/dataset/hxl-core-schemas
  fontem:
    - ontologia/codicem/hxl/standard/core/hashtag.hxl.csv
    - https://proxy.hxlstandard.org/data.csv?dest=data_edit&strip-headers=on&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI%2Fpub%3Fgid%3D319251406%26single%3Dtrue%26output%3Dcsv
    - https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/pub?gid=319251406&single=true&output=csv

Example 2

- urn: "urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica"
  descriptionem:
    eng-Latn: >
      Table with code references for body parts, in special
      Terminologia Anatomica (TA). Can be used with other ontologies and
      to transform for a few natural languages descriptions.
  explanandum:
    # Good references:
    - +v_fipat_ta2
    - +v_fipat_ta98_id
    - +v_fipat_ta98_latin
    # Generic references:
    - +v_wikidata
    - +v_fi_yso
    - +v_fr_universalis
    - +v_it_bncf
    - +v_jp_ndl
    - +v_uberon
    - +v_uk_britannica
    - +v_us_jstor
    - +v_us_mag
    - +v_us_mesh
    - +v_us_umls_cui
  auxilium:
    - https://github.com/HXL-CPLP/forum/issues/44
    - https://www4.unifr.ch/ifaa/Public/EntryPage/TA98%20Tree/HelpPage/TA98%20Latin%20Page%20Help.pdf
  exemplum:
    # Since terminologia-anatomica.hxl.csv 1,8mb, we only deploy a sample
    - ontologia/codicem/anatomiam/terminologia-anatomica-EXEMPLUM.hxl.csv
  fontem:
    # run ontologia/codicem/anatomiam/make.sh to get terminologia-anatomica.hxl.csv
    # or let the urnresolver download from live URNs
    - ontologia/codicem/anatomiam/terminologia-anatomica.hxl.csv
    - https://proxy.hxlstandard.org/data/b02a5f/download/HXL_CPLP-FOD_medicinae-legalis_humana-corpus.csv
    - https://docs.google.com/spreadsheets/d/10axnLpDNtAc8Bh921dz5XPXCwo0FUXRcKS6-ermiu5w/edit#gid=1622293684

Old format


# URNResolver v1.2.1
# hdp-toolchain v0.8.7.2

# @see https://data.humdata.org/dataset/hxl-core-schemas
- urn: "urn:data:xz:hxl:standard:core:hashtag"
  source:
    - ontologia/codicem/hxl/standard/core/hashtag.hxl.csv
    - https://proxy.hxlstandard.org/data.csv?dest=data_edit&strip-headers=on&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI%2Fpub%3Fgid%3D319251406%26single%3Dtrue%26output%3Dcsv
    - https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/pub?gid=319251406&single=true&output=csv
fititnt commented 3 years ago

Added reverse search, like urnresolver -?? +v_iso15924 or urnresolver -?? country+code+v_iso2

I believe we will need to build some table that could give a hint that some codes, like country+code+v_iso2, could also mean other variants of ISO 3166-1. This could help more automated search of what something means.

# fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ urnresolver -?? +v_iso15924
urn:data:xz:eticaai:ontologia:codicem:linguam
# fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ urnresolver -?? country+code+v_iso2
urn:data:xz:eticaai:ontologia:codicem:locum
# fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ urnresolver --urn-explanandum-list
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_fipat_ta2
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_fipat_ta98_id
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_fipat_ta98_latin
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_wikidata
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_fi_yso
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_fr_universalis
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_it_bncf
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_jp_ndl
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_uberon
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_uk_britannica
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_us_jstor
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_us_mag
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_us_mesh
urn:data:xz:eticaai:ontologia:codicem:anatomiam:terminologia-anatomica  +v_us_umls_cui
urn:data:xz:eticaai:ontologia:codicem:sexum:binarium    +v_iso5218
urn:data:xz:eticaai:ontologia:codicem:sexum:binarium    +v_iso5218_extended
urn:data:xz:eticaai:ontologia:codicem:sexum:binarium    +v_fipat_ta98_latin
urn:data:xz:eticaai:ontologia:codicem:sexum:hl7 +v_iso5218
urn:data:xz:eticaai:ontologia:codicem:sexum:hl7 +v_iso5218_extended
urn:data:xz:eticaai:ontologia:codicem:sexum:hl7 +v_us_cdc_sex
urn:data:xz:eticaai:ontologia:codicem:sexum:hl7 +v_un_icao_sex
urn:data:xz:eticaai:ontologia:codicem:sexum:hl7 +v_us_NAACCR
urn:data:xz:eticaai:ontologia:codicem:sexum:hl7 +v_us_census_sex
urn:data:xz:eticaai:ontologia:codicem:sexum:non-binarium    +lat_codices_anonyma
urn:data:xz:eticaai:ontologia:codicem:sexum:non-binarium    +v_iso5218_extended
urn:data:xz:eticaai:ontologia:codicem:linguam   +v_iso15924
urn:data:xz:eticaai:ontologia:codicem:locum country+code+v_iso2
urn:data:xz:eticaai:ontologia:codicem:locum country+code+v_iso3
urn:data:xz:eticaai:ontologia:codicem:locum +v_hrinfo_country
urn:data:xz:eticaai:ontologia:codicem:locum +v_reliefweb
urn:data:xz:eticaai:ontologia:codicem:locum country+code+v_reliefweb
# fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ urnresolver -? urn:data:xz:hxl:standard:core:attribute
[
    {
        "urn": "urn:data:xz:hxl:standard:core:attribute",
        "descriptionem": {
            "eng-Latn": "HXL/CSV version of the HXL Standard core attributes."
        },
        "auxilium": [
            "https://data.humdata.org/dataset/hxl-core-schemas"
        ],
        "fontem": [
            "ontologia/codicem/hxl/standard/core/hashtag.hxl.csv",
            "https://proxy.hxlstandard.org/data.csv?dest=data_view&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI%2Fpub%3Fgid%3D1810309357%26single%3Dtrue%26output%3Dcsv&strip-headers=on",
            "https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/pub?gid=1810309357&single=true&output=csv"
        ],
        "urnref": "urnresolver-default.urn.yml"
    }
]
sabas commented 3 years ago

👋 Thank you!