geolexica / geolexica-server

Generalized backend for Geolexica sites
2 stars 1 forks source link

Implement RDF profile for ISO terms #2

Open ronaldtse opened 4 years ago

ronaldtse commented 4 years ago

This is to implement proper RDF (TTL, JSON-LD) for ISO terms in Geolexica.

ronaldtse commented 4 years ago

Ping @lanemk . Thanks!

lanemk commented 4 years ago

Ron, Here are the latest files. Let me know any issues you have.

rdf-profile.zip

lanemk commented 4 years ago

Ron, I labeled the 46.jsonld file incorrectly. It should be 47.jsonld. Maybe you filled in that gap already. I'm attaching 47.jsonld. 47.zip

ronaldtse commented 4 years ago

Thank you @lanemk ! I'm working on them now...

lanemk commented 4 years ago

@ronaldtse I don't know if you can use these. Maybe. geolexica-terms.zip

ronaldtse commented 4 years ago

Thanks @lanemk , I can certainly use them, but will need to generalize this list into a template because the term index has to be dynamically generated...

lanemk commented 4 years ago

@ronaldtse Here are updated files. I fixed things, hopefully. Simplified rdf-profile.ttl and made 47.ttl a little more generic. rdf-profile-20191110.zip

lanemk commented 4 years ago

@ronaldtse Here are updated files. geolexica-skos-20191127.zip

lanemk commented 4 years ago

@ronaldtse Here are updated files. The geolexica-pages template has more fields, maybe too many. Welcoming questions here. geolexica-20191128.zip

lanemk commented 4 years ago

@ronaldtse I mistakenly sent you an empty geolexica-pages file. Here's the data-filled version. geolexica-pages.zip

ronaldtse commented 4 years ago

@lanemk I am a bit confused:

Now we have language-codes.ttl and country_codes_ap.ttl. Do we need to serve those files? Or can we point elsewhere for those codes? The language codes are ISO 639-2 codes, and country codes are ISO 3166 codes.

In language-codes.ttl, it goes:

:EST-Estonian
  rdf:type geolexica:Language_Code ;
  skos:altLabel "eesti, eesti keel"@en ;
  skos:notation "EST" ;
  skos:prefLabel "Estonian"@en ;
...

The altLabel here is clearly in Estonian (es), not en. And there are also two labels here in Estonian -- don't we need to separate them?

In RDF, do we use title case like prefLabel or underscores like language_code?

What are things like Japanese_Node for?

:Japanese_Node
  rdf:type :Language_Node ;
  dcterms:modified "2017-10-19"^^xsd:anyURI ;
  dcterms:title "ISO/TC211関連JIS用語集" ;
  schema:mainEntityOfPage "https://www.geolexica.org/registers/#language-jpn"^^xsd:anyURI ;
  rdfs:label "Japanese Language Node" ;
.

This for example is the "Japanese term registry".

ronaldtse commented 4 years ago

Also, linked-data-api was just something I saw another service use, it is of no consequence.

ronaldtse commented 4 years ago

@lanemk a few more questions:

  1. We used to use rdf:type of skos:Concept but now schema:ItemPage. What is the difference?

  2. geolexica-ap:authoritative_source geolexica-ap:ISO_19132_2007 is used to indicate the source. However, we clearly cannot manually enumerate all the sources. Do we need a separate TTL file that lists out all the sources? Or can these sources remain as strings for now?

  3. What do these do? dcterms:identifier "geolexica-ap:empty_field" geolexica-ap:conceptURI geolexica-ap:Concept_URI ;

  4. geolexica-ap:date_accepted "2019-11-28"^^xsd:date ;
    geolexica-ap:date_amended "2019-11-28"^^xsd:date ;

Why not:

  dcterms:dateAccepted "{{ concept.date_accepted | date: "%F" }}" ;
  dcterms:modified "{{ concept.date_amended | date: "%F" }}" ;

?

5.

  geolexica-ap:example_n geolexica-ap:empty_field ;
  geolexica-ap:note_n geolexica-ap:empty_field ;

There could be many examples and notes. Can we have:

  geolexica-ap:example "Example 1. Car" ;
  geolexica-ap:example "Example 2. Truck" ;
  geolexica-ap:note "A vehicle can travel on ground." ;

6.

  geolexica-ap:review_date "2019-11-28"^^xsd:date ;
  geolexica-ap:review_decision geolexica-ap:accepted ;
  geolexica-ap:review_decision_date "2019-11-28"^^xsd:date ;
  geolexica-ap:review_decision_notes geolexica-ap:empty_field ;
  geolexica-ap:review_indicator geolexica-ap:empty_field ;
  geolexica-ap:review_type geolexica-ap:supersession ;

There could be multiple "reviews" leading to multiple "review notes". How do we handle them?

  1. What is this? geolexica-ap:term_synonym geolexica-ap:synonym ;

  2. Why do we put the identifier as empty?

  dcterms:identifier "geolexica-ap:empty_field" ;
  1. Do we really need these?
  geolexica-ap:country_code <http://www.fao.org/countryprofiles/geoinfo/geopolitical/resource/geopolitical.owl#Sweden> ;
  geolexica-ap:country_code <http://www.fao.org/countryprofiles/geoinfo/geopolitical/resource/geopolitical.owl#United_Kingdom_of_Great_Britain_and_Northern_Ireland__the> ;
  geolexica-ap:language_code <https://www.geolexica.org/api/language-codes#SPA-Spanish_Castilian> ;
  geolexica-ap:language_code <https://www.geolexica.org/api/language-codes#SWE-Swedish> ;
  geolexica-ap:language_node geolexica-ap:Arabic_Node ;
  geolexica-ap:language_node geolexica-ap:Bahasa_Node ;

Or do we only need them if the term contains those languages (and that the countries utilize the term)?

lanemk commented 4 years ago

@ronaldtse I have responses for you, in bold text.

Now we have language-codes.ttl and country_codes_ap.ttl. Do we need to serve those files? Or can we point elsewhere for those codes? The language codes are ISO 639-2 codes, and country codes are ISO 3166 codes. ... The altLabel here is clearly in Estonian (es), not en. And there are also two labels here in Estonian -- don't we need to separate them? In RDF, do we use title case like prefLabel or underscores like language_code? What are things like Japanese_Node for? ...

Alas, I struggled with the translations, codes, and the registry. I tried to line everything up, so codes could be used across the Concept pages and the Registers. I believe now this is a muddied effort, and I should, or will work to simplify matters. I’m believe I’m confusing Country code (in the data definition) with Language code (which should only be a reference, for instance, to language-tag skos:prefLabel, skos:definition, or perhaps Term_Abbreviation).

I tried to find standards-based, api-accessible country codes: https://datahub.io/core/country-codes

The same with language codes: https://datahub.io/core/language-codes

This data can be accessed through an API, or CSV files can be downloaded and transformed into RDF (TTL, JSON-LD) for use in Geolexica. Do you have a preference for how to proceed?

lanemk a few more questions:

We used to use rdf:type of skos:Concept but now schema:ItemPage. What is the difference?

“schema:ItemPage” is a predefined class (a new class in geolexica-ap), used to represent the Concept template page itself. The ItemPage is a compendium of all relevant info about one concept (term), a container of sorts. ItemPage is a “concept”, but not really for the current SKOS framework. This leads to this…. All glossary terms (concepts) are now serialized as “skos:Concept(s)” in the accompanying geolexica-terms.ttl file. Hopefully these can be referenced in the API to serve “rdfs:label” or “skos:prefLabel” values on a given template page.

geolexica-ap:authoritative_source geolexica-ap:ISO_19132_2007 is used to indicate the source. However, we clearly cannot manually enumerate all the sources. Do we need a separate TTL file that lists out all the sources? Or can these sources remain as strings for now?

Authoritative source as a string is fine (as in the data definition anyway). To address the number of sources, I was trying to preserve that URL, or any URL, for accessing the standards documentation. A separate RDF file of these sources may be appropriate. Just import the geolexica-ap to align with the ontology. I can work on that

What do these do? dcterms:identifier "geolexica-ap:empty_field" geolexica-ap:conceptURI geolexica-ap:Concept_URI ;

dcterms:identifier "geolexica-ap:empty_field" is a predefined property and dummy value for geolexica-ap property “termID”. I’m merely reusing “dcterms:identifier” from the Dublin Core namespace. While I favor reusing standard metadata, the local “geolexica-ap:termID” can be used just as well. To note, “dcterms:identifier” is also used for “termID” in geoloexica-terms.ttl. An owl:sameAs property can be applied to termID/identifier to match them in RDF space.

geolexica-ap:conceptURI is the “property” seeking values from the class geolexica-ap:Concept_URI. Concept_URI is intended to link to the Concept page by URL. This is essentially self-referential within the template, and is inconsequential. It can be removed.

geolexica-ap:date_accepted "2019-11-28"^^xsd:date ; geolexica-ap:date_amended "2019-11-28"^^xsd:date ;

Why not: dcterms:dateAccepted "{{ concept.date_accepted | date: "%F" }}" ; dcterms:modified "{{ concept.date_amended | date: "%F" }}" ;

I think your code is fine. I entered the actual dates as dummy values, placeholders.

geolexica-ap:example_n geolexica-ap:empty_field ; geolexica-ap:note_n geolexica-ap:empty_field ;

There could be many examples and notes. Can we have: geolexica-ap:example "Example 1. Car" ; geolexica-ap:example "Example 2. Truck" ; geolexica-ap:note "A vehicle can travel on ground." ;

I can rename these properties as “example” and “note”. There can be as many as desired, much like skos:definition(s), without language tags. A language tag (i.e., geolexica-ap:note "A vehicle can travel on ground."@en) can be applied to examples and notes as needed.

“geolexica-ap:empty_field” is just a placeholder (a dummy value).

geolexica-ap:review_date "2019-11-28"^^xsd:date ; geolexica-ap:review_decision geolexica-ap:accepted ; geolexica-ap:review_decision_date "2019-11-28"^^xsd:date ; geolexica-ap:review_decision_notes geolexica-ap:empty_field ; geolexica-ap:review_indicator geolexica-ap:empty_field ; geolexica-ap:review_type geolexica-ap:supersession ;

There could be multiple "reviews" leading to multiple "review notes". How do we handle them?

I could create “review notes” enumerations within its respective class (i.e., geolexica-ap:review_decision_notes). Or it might be specified as a repeatable string value.

What is this? geolexica-ap:term_synonym geolexica-ap:synonym ;

Term Synonyms are specified in the data definition and geolexica-ap:term_synonym is the property to deliver values of this class. Once in a while they are showing up as strings in the dataset.

Why do we put the identifier as empty? dcterms:identifier "geolexica-ap:empty_field" ;

"geolexica-ap:empty_field" is a placeholder, used when I had no values to draw from. “geolexica-ap:termID” may win out over “dcterms:identifier”. They are both unique identifiers. “termID” matches the local ontology.

Do we really need these? geolexica-ap:country_code http://www.fao.org/countryprofiles/geoinfo/geopolitical/resource/geopolitical.owl#Sweden ; geolexica-ap:language_node geolexica-ap:Bahasa_Node ;

Or do we only need them if the term contains those languages (and that the countries utilize the term)?

I need to simplify the language and country nodes and codes. I’m taking a chainsaw to a much more delicate problem. Let me think about it and present you with a solution.

lanemk commented 4 years ago

@ronaldtse Hi Ron, Here's my progress.

I did a redux on language and country codes. I included only those present in the data definition. This can be extended, if needed.

In the language code file, you will find reference to skos data, e.g.,

:CHI-Chinese rdf:type geolexica:LanguageCode ; skos:altLabel "中文 (Zhōngwén)"@zh ; skos:altLabel "汉语, 漢語"@zh ; skos:notation "CHI" ; skos:prefLabel "Chinese"@en ;

So, when this is referenced in the geolexica-pages file: e.g., "geolexica-ap:language_code https://www.geolexica.org/api/language-codes#CHI-Chinese ;", the heart of the matter (properties/values) is in "language-codes", whether it's TTL or JSON-LD. I also included a direct string label in the geolexica-pages file, e.g., "geolexica-ap:langCode "CHI"@en ;", so they are there if you find them convenient.

You will find the geolexica-pages file stripped of many dummy values now in favor of placeholder data. I tried to align datatypes where dynamic code, presumably, can fill in values appropriately. Let me know how this is working out. I hope it meets your needs.

I'm curious how you're accessing all "skos:definition" and "skos:prefLabel" values in different languages for the geolexica-pages file. Do these values need to be in an RDF dataset? I guess I'm a little confused about your method. Any insight is welcome.

The geolexica-terms file now includes a view of all glossary terms as skos Concept(s), and geolexica-ap:GlossaryTerm(s). This includes terms' "skos:prefLabel" in English, and "geolexica-ap:termID", and the superfluous "dcterms:identifier", which you can ignore.

The geolexica-ap file is now expanded to match (or, map) fields from the data definition to RDF/SKOS classes/concepts. You will find classes with or without instance data, depending on enumerations, or any other datatyped value. It will depend on the class, and related properties that reference values.

I regret I have a mix of camel case and "_" separation for naming my properties and classes. I wanted to make them readable, but I also want to be consistent. I suppose it's a matter of preference. I can rename everything to camel case if you'd like. But they should work just fine, there is no strict rule in RDF/SKOS. Just match "resources" as named to dynamic values.

I think that is most everything. I look forward to your feedback.

geolexica-20191205.zip

lanemk commented 4 years ago

@ronaldtse Ron, here are concept terms in English. Is this a dataset more like you're looking for? I can do this for the remaining languages.

concepts_english.zip

lanemk commented 4 years ago

@ronaldtse Hi Ron, I hope this finds you well. I've been wrestling the yaml format into rdf. My apologies for taking so long. It was a real grudge match at times. But in any case, I picked concept #10 to work out a page template. The template now seems to have space for all attribute values in all languages. The "10-original-term.ttl" file ought to give you an idea how it looked right after yaml conversion. I expanded this to the full "10-template" file, in Turtle and JSON-LD. The application profile is now updated as well, "geolexica-rdf". Thoughts? Thanks! geolexica-rdf.zip

lanemk commented 4 years ago

@ronaldtse Ron, I offer some updates...mostly cleaning up and simplifying the RDF profile, and the concept page template. Let me know if you have questions.

  1. The file geolexica-rdf.ttl is now geolexica-ap.ttl. Apologies for jumping around, but that file name will stick. I also settled on the default namespace: https://www.geolexica.org/api/rdf. For concept pages, the default namespace is https://www.geolexica.org/concepts/term, and it imports the rdf file.
  2. The file 10-template.ttl is a somewhat grandiose language template. I'm trying to account for every possibile value on attributes. On the simpler side, try the trimmed down version: concept-template.ttl.
  3. All files are in .ttl and jsonld.
    ~cheers geolexica-ap-and-template.zip
ronaldtse commented 4 years ago

Thanks @lanemk ! Sorry for the less than rapid responses, but I will try to get these implemented before the new year :wink:

The way I’m doing it isn’t quite working since it’s more of a hack, but should be able to transition to a proper approach using a Ruby library as a Jekyll plugin.

lanemk commented 4 years ago

Thanks @ronaldtse, absolutely no worries. I only hope I can give you something you can work with. I've been studying up on Ruby in the meantime. So, I'll be curious on the approach, and maybe I can chip in, who knows. ~cheers!

lanemk commented 4 years ago

@ronaldtse Hi Ron...I hope all's well. Not sure if you're familiar with Ruby-RDF. These are, in their words, "Public domain libraries for RDF & SPARQL in the Ruby programming language." In other words, readers & writers of many types. I think you may find them useful. -- cheers! https://github.com/ruby-rdf https://ruby-rdf.github.io/

ronaldtse commented 4 years ago

Thanks @lanemk ! Ping @skalee do you have time to handle this?

skalee commented 4 years ago

OK @ronaldtse.

skalee commented 4 years ago

@ronaldtse I'm trying to understand what is your request about, but without much success so far.

ronaldtse commented 4 years ago

@skalee can you Skype me?

ronaldtse commented 4 years ago

@skalee is going to take care of this. Any updates so far?

lanemk commented 4 years ago

@ronaldtse @skalee No updates. Let me know if there are questions about the SKOS/RDF.

skalee commented 4 years ago

@lanemk I got two questions.

  1. I think I have spotted inconsistency in term URIs. The 10-template.ttl has:
<https://www.geolexica.org/concepts/10/#> rdf:type skos:Concept ;

whereas concept-template.ttl has:

<https://www.geolexica.org/concepts/term#10> a skos:Concept ;

Note the …/concepts/10/# vs …/concepts/term#10 difference. Please clarify which one is correct.

  1. Our YAML files have three letter language codes whereas Turtle files have two letter codes (e.g. eng vs en). I suppose this is correct, right?
skalee commented 4 years ago

@lanemk @ronaldtse Got another couple of questions:

  1. Values of many triples, e.g. <https://www.geolexica.org/concepts/term#10> geolexica:authoritative_source " "., are currently taken from English data, that is they are not localized. Is it correct, given the fact that in YAML file this information is defined for each language separately, though typically duplicated, as in the following example?
---
term: admitted term
termid: 10
eng:
  id: 10
  term: admitted term
  authoritative_source:
    ref: ISO 1087-1:2000
    clause: 3.4.16, modified — the Note 1 to entry has been added.
    link: https://www.iso.org/standard/20057.html
ara:
  id: 10
  term: مصطلح معترف به
  authoritative_source:
    ref: ISO 1087-1:2000
    clause: 3.4.16, modified — the Note 1 to entry has been added.
    link: https://www.iso.org/standard/20057.html
  1. Currently if some property is undefined in YAML file, it is skipped in RDF as well. For example skos:altLabel, which is seldom defined. Is it correct?
ronaldtse commented 4 years ago

@lanemk could you help answer questions 1 and 2?

  1. Please treat the YAML data as correct as it is. Some of the data is not translated properly but that's not our problem and we don't have to fix them.

  2. Yes, correct. If they are not defined, don't include in the RDF (as well as JSON).

Thanks!

skalee commented 4 years ago

@lanemk @ronaldtse ping, clarification needed regarding above questions.

ronaldtse commented 4 years ago

Ping @lanemk , thanks!

ronaldtse commented 4 years ago

@skalee regarding 2-char vs 3-char language codes (2.), the answer is this:

https://listserv.loc.gov/cgi-bin/wa?A2=ind1407&L=BIBFRAME&P=853

In BCP 47 (aka RFC 5646) it is not a question of “preference” in terms of which language code to use. It has been a long-standing policy in the previous RFC’s concerning language coding (RFC4646 and its predecessors) that if using that specification you use the 2-character language code from ISO 639-1 if there is one as the primary subtag. If there is not you use the 3-character one from ISO 639-2. The ISO 639-1 list has essentially been frozen for many years and includes far fewer language codes (because obviously there are fewer combinations when using 2 characters) than ISO 639-2 (or 3).

Rebecca (former chair of the ISO 639 Joint Advisory Committee)

Since RDF refers to BCP47 in its "language tag", we should use a 2-char ISO 639-1 code if it exists, otherwise a 3-char ISO 639-2 code.

skalee commented 4 years ago

@ronaldtse, I still need clarification in one aspect. Which one of these URIs is correct:

The 10-template.ttl has:

<https://www.geolexica.org/concepts/10/#> rdf:type skos:Concept ;

whereas concept-template.ttl has:

<https://www.geolexica.org/concepts/term#10> rdf:type skos:Concept ;
ronaldtse commented 4 years ago

@lanemk any recommendation regarding the question above?

If we can choose any, let's go with the first one without the #, i.e.

<https://www.geolexica.org/concepts/10> rdf:type skos:Concept ;

References:

lanemk commented 2 years ago

@ronaldtse Here's a draft of the Geolexica ontology (one ontology, 3 file formats) which addresses most if not all issues to model the concepts. I will have some clarifying questions later, but I invite you to take a look.

Next I'll work out a draft of an individual concept (i.e., </concepts/2>) in SKOS-RDF, JSON-LD, and an abbreviated full data set to see what they should look like with the above ontology. From there, SPARQL queries ought to retrieve the data.

I look forward to your questions/comments.

geolexica-ontology-DRAFT-20221901.zip

lanemk commented 2 years ago

@ronaldtse Here's a draft of an individual concept (</concepts/6/>) in RDF/XML, JSON-LD, and TTL, and an abbreviated full data set with only 5 concepts for now. I also fine-tuned the ontology and it is attached as well, to support the other files. Next, I'll cook up some SPARQL queries to slice and dice the data set. Please bear in mind I focused solely on the MLGT Glossary, since it is large and varied. I'm reasonably confident it solves all the problems. I'll know better after the SPARQL phase. --Mike

geolexica-ontology-DRAFT-20222001.zip mlgt-concepts-DRAFT-20222001.zip