EBISPOT / ols4

Version 4 of the EMBL-EBI Ontology Lookup Service (OLS)
http://www.ebi.ac.uk/ols4/
Apache License 2.0
39 stars 18 forks source link

CURIE literals #167

Closed jamesamcl closed 1 year ago

jamesamcl commented 1 year ago

This kind of thing is less than ideal:

Screenshot 2023-01-12 at 02 21 08

Unfortunately that's how they are represented in the ontology. The OLS3 API (and our reimplementation of it) already deals with this by loading the obo mapping file from prefixes to database URLs. In OLS3 this happens as part of the dataload; in the OLS4 reimplementation the API server does it at query time.

We should either map them to URLs in the OLS4 API, or move the mapping to the dataload level, adding the URLs using an annotator in owl2json. I don't have any strong opinions about this, other than that they should be links in the OLS4 UI.

matentzn commented 1 year ago

We have had a huge amount of work over the last year about this, libraries being build and more..

My initial thought is this: we basically do not know what the prefix refers to. Ever. ICD can refer to ICD10 or ICD9. There is a lot of conflicts (@cthoyt can sing us a song). So you will have to choose: make a best guess for what these curies mean and use the canonical redirects (through a bioregistry context for example) - or ignore them (leave as is). I think both are valid. Protege is doing a mix: curate a handful of high confidence prefixes and render those as links (using a bioregistry context and "curies" tool). One thing that should disappear from any modern software tool is any attempt at all to go between CURIE and IRI syntax with string hacking. Cc @cmungall.

cthoyt commented 1 year ago

actually, I am more optimistic now than ever before. I already put in most of the hard work to go over and over the OBO ontologies and identify the strange prefixes and get all of the synonyms curated in the Bioregistry. There are only a few abiguities like ICD referring to ICD10 or ICD9, which could potentially just be stored in ontology-specific configuration in the OLS.

I would LOVE to see some Bioregistry integration in OLS for operationalizing these CURIEs - I think this is totally doable. Would be happy to help however I can.

jamesamcl commented 1 year ago

@matentzn OK - I see your point but OLS3 already does some of these mappings. Unless OLS4 has 100% API compatibility with OLS3 there will be no OLS4. So they have to be implemented as a starting point.

@cthoyt That is encouraging. I could make OLS load this file pretty easily?

matentzn commented 1 year ago

Dont stress now about it, but if you have a minute at some point, can you say what "OLS3 already does some of these mappings" means? You mean using this db-xref.yml thing from GO? Or are you still talking about the String hacking?

jamesamcl commented 1 year ago

Yes I mean the db-xref.yml thing from GO. Is that not what you mean by string hacking?

matentzn commented 1 year ago

No as long as you use dbxref.yaml its not the worst thing in the world, but the better way is to create an OLS context on bioregistry (here is the OBO one: https://bioregistry.io/context/obo) and roll with that :) bioregistry even indexes the GO prefix yaml file: https://bioregistry.io/metaregistry/go.

String hacking is any place in the code that looks like an IRI is converted to a CURIE with things like "replace("_",":")", where an actual URI is used (if URL.startswith(http://obo.org): curie = URL.replace(http://obo.org, "") type stuff)..

Doing this right should not at all affect the API structurally!

cthoyt commented 1 year ago

@cthoyt That is encouraging. I could make OLS load this file pretty easily?

@udp this isn't the right file - the metaregistry.json only contains information about external registries. There are several more appropriate existing views over the Bioregistry that represent the same content (in different formats and to different degrees of detail):

  1. https://raw.githubusercontent.com/biopragmatics/bioregistry/main/exports/registry/registry.yml has a complete view over all prefixes and all metadata associated with them.
  2. https://raw.githubusercontent.com/biopragmatics/bioregistry/main/exports/registry/registry.json same file as above but in JSON
  3. https://github.com/biopragmatics/bioregistry/blob/main/exports/contexts/bioregistry.context.jsonld a simple JSON-LD context file, does not contain synonyms
  4. https://github.com/biopragmatics/bioregistry/blob/main/exports/contexts/bioregistry.epm.json focuses just on prefixes and expansions, includes synonyms

If you want me to implement an exporter that has an alternate format for loading in OLS, just tell me. I could potentially create something that looks like the GO db-xref.yaml file, too.

jamesamcl commented 1 year ago

@cthoyt thanks for the clarification! Option 2 looks perfect. Happy to write the code on our end to load it as OLS4 is still in very active development. I'll update here when done.

jamesamcl commented 1 year ago

@cthoyt is the uri_format in the JSON always intended to be resolvable URLs? Our current use case for this in OLS is making CURIEs into clickable links. It looks like uri_format does make URLs but wanted to check as the naming of the field doesn't indicate this (should it be url_format?)

cthoyt commented 1 year ago

@udp typically the uri_format is a resolvable URL, though this isn't a strict requirement. These aren't resolvable for two main reasons:

  1. Ontologies require URIs/IRIs, so sometimes a scheme is created that doesn't point to anything
  2. Services move/stop getting maintained/go down

Note that all have one $1 that is a placeholder for the local unique identifier from a CURIE.

jamesamcl commented 1 year ago

Thanks,

  1. This is really unfortunate because it means I can't in good conscience turn it into a link in OLS. It would be amazing if there was a way to differentiate between URIs/IRIs and URLs so I know what is safe to turn into a link.

  2. Don't care at all as is firmly not our problem :D

jamesamcl commented 1 year ago

Very exciting results so far though! Compare this:

Screenshot 2023-01-14 at 04 05 19

to the screenshot at the beginning of the issue.

jamesamcl commented 1 year ago

After implementing your comments from the PR @cthoyt :

Screenshot 2023-01-16 at 01 42 09
matentzn commented 1 year ago

This is pretty awesome. Please make sure that OBO ids redirect to OBO purls! :)

cthoyt commented 1 year ago

@udp what do the info boxes show? it would be nice to give feedback on some of these CURIEs, for example, if it has a valid prefix but an invalid identifier that doesn't match the regular expression. I think this is already demonstrated in this example - these ICD10EXP terms don't really make sense with their punctuation at the end

matentzn commented 1 year ago

I think the current behaviour is good, but a small question mark next to the CURIE without a URL would be good that opens a popup that says: "We could not determine a suitable linkout for this CURIE. If you know the meaning of the CURIE, report it at https://github.com/biopragmatics/bioregistry/issues/new".

I don't think unknown prefixes should be hidden, they should be shown and exposed to increase the incentive to reconcile.

This one is from MONDO, and it is because it is imported from Orphanet who use the a special language to encode ICD code ranges. The correct response here would be to tell the Mondo team to remove these cross references from the released ontology - not sure how can streamline that process (as the user does not know if it is a non-standard prefix or a unregistered standard prefix). What do you think?

In this case:

cthoyt commented 1 year ago

agree on all points with nico, but a few comments:

  1. ICD9 and ICD9CM are considered as different prefixes in Bioregistry ATM. There's a long discussion about ICD at https://github.com/biopragmatics/bioregistry/issues/251
  2. Orphanet is indeed available in Bioregistry - we've fixed the way that it's shown on OLS4 in my PR
  3. The ICD10EXP prefix appears to be valid ICD10 if you get rid of the punctuation at the end. I this case, we should provide feedback that this content is incorrect

All that being said, the bioregistry can be used to derive custom registries - Nico and I will support you to encode the appropriate rules for custom prefixes, URI formets, etc. that OLS needs. We did this already for the OBO foundry in https://bioregistry.io/context/obo.

matentzn commented 1 year ago

Great about 1/2, agreed with current line of thinking. I assume then they should be resolvable in the next updates?

The ICD10EXP prefix appears to be valid ICD10 if you get rid of the punctuation at the end. I this case, we should provide feedback that this content is incorrect

No, not quite. The punctuation means something that changes the meaning of the prefix... It should better not be released into the wild..

cthoyt commented 1 year ago

Great about 1/2, agreed with current line of thinking. I assume then they should be resolvable in the next updates?

The ICD10EXP prefix appears to be valid ICD10 if you get rid of the punctuation at the end. I this case, we should provide feedback that this content is incorrect

No, not quite. The punctuation means something that changes the meaning of the prefix... It should better not be released into the wild..

🙃

matentzn commented 1 year ago

Opened a new issue: https://github.com/monarch-initiative/mondo/issues/5791

jamesamcl commented 1 year ago

@cthoyt the info boxes show annotations on the annotation (reification), e.g. the source of the mapping.

Some great thoughts here - thanks for all of the interest! I like the idea of having a link to the issue tracker where mappings are missing.

jamesamcl commented 1 year ago

Fixes for this are lined up but tracking using issue #222