ESIPFed / sweet

Official repository for Semantic Web for Earth and Environmental Terminology (SWEET) Ontologies
Other
115 stars 33 forks source link

initial mapping of SWEET realms to GCMD science keywords #225

Closed brandonnodnarb closed 1 year ago

brandonnodnarb commented 3 years ago

partially addresses #159 Per the last discussion at the most recent semtech meeting, massive PRs of this nature should be broken up by group (realm, material, etc.) As such I've created a draft mapping of GCMD science keywords to SWEET realms (all realm* files).

There are 212 candidates, all of which are skos:exactMatch. As the ttl file is sparse, please also see the results spreadsheet.

I used the following method: Create key-value pair for each SWEET IRI and rdfs:label via SPARQL; Create key-value pair for each GCMD IRI and skos:prefLabel via SPARQL. Then:

for each k1,v1 pair in SWEET
  for each k2,v2 pair in GCMD
    if v1 (sequenceMatch) v2 >= 0.90:
      return k1, v1, match% k2, v2, GCMD_def

I did it this way out of sheer curiosity and the results were not as horrid as I would have thought. I manually went through and removed the obvious false positives due to syntax --- adsorption != absorption for example (but still meets the 90% similarity threshold). I also removed false positives due to the GCMD definition. A simple example being SWEET:rift valley which has (IIRC) three matches to GCMD, two of which are fine and you'll find them in the mapping file, but one of which has a definition including text about mid-ocean ridge. On the face of it, it's fine, but as the SWEET term is situated in realmLandTectonic the candidate mapping was removed.

There are a host of other issues which can be addressed in due course --- assuming at least some of you agree the results aren't garbage. :)

I tagged a bunch of you in hopes at least 3 would be able to have a look. It opens fine for me in Protege. I did not test TopBraid.

graybeal commented 3 years ago

I reviewed the whole spreadsheet quickly, it was interesting on several fronts. Sorry that I'm not providing this as changes to code, I don't have time for that at the moment.

brandonnodnarb commented 3 years ago

Thanks @graybeal

Multiple identical GCMD names with different definitions. Mostly this was OK, but one SHORELINES definition was not, referring to the "lines on a map". That is a bad GCMD definition in this context (a map is not a part of the land realm) and it should be removed by them. One SEA ICE definition was likewise just inappropriate.

Good catch. I will remove the relationship in the code and line out those rows in the spreadsheet.

The Lead definitions also were very different, neither of them were very good definitions. So I don't know whether you should match none, one, the other, or both; maybe not both because inferencing would create some disconnect (a 'concentration' is the same as an 'element'?!). On the other hand, we've long said SWEET isn't about precision…

Hmm. They don't seem wrong they just aren't...precise. However, we could certainly change the link to skos:closeMatch or even skos:related.

I think the definitions would be another good source of our definition pile (technical term), if we keep going down that path.

Yeah i forgot to mention that in my original comment. I wasn't sure where we landed on that one.

tbs1979 commented 3 years ago

@brandonnodnarb are these the mappings for GCMD that you need some feedback on?

brandonnodnarb commented 3 years ago

Apologies for missing your comment, @tbs1979. Yes, this be they.

brandonnodnarb commented 2 years ago

This is now stale as GCMD has had several version releases since this was done. I will investigate GCMD's concept and URI deprecation policy. If they do not cull URIs then these should still be valid, although potentially not comprehensive.

Either way this should be re-run and probably in a more sustainably reproducible way -- e.g. adding the workflow/code to the sweet-tools repo.

brandonnodnarb commented 1 year ago

Based on a response to my question on this issue, it is my understanding GCMD remove concepts, including URIs, when they are deprecated. The http status code for each GCMD URI in this file still returns 200. As such, I believe these mappings are still valid.

I will add it to the agenda of the next STC meeting to discuss before I merge this PR.

brandonnodnarb commented 1 year ago

Per the discussion at the previous STC meeting, this is good to go.

Need to add issue to the sweet-tools repo to automate checking GCMD URIs.