biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
119 stars 51 forks source link

`curie_from_iri()` fails when angle brackets present in URI #479

Closed joeflack4 closed 2 years ago

joeflack4 commented 2 years ago

Overview

I know that we are moving to OAK, but I've been having a lot of trouble getting my OAK use cases to work so far, and @matentzn recommended that I try this bioregistry function for now.

Example

>>> curie_from_iri(
...    'https://icd.who.int/browse10/2019/en#/R63.8',
...    prefix_map={'ICD10WHO': 'https://icd.who.int/browse10/2019/en#/'},
... )
'ICD10WHO:R63.8'

>>> curie_from_iri(
...    '<https://icd.who.int/browse10/2019/en#/R63.8>',
...    prefix_map={'ICD10WHO': 'https://icd.who.int/browse10/2019/en#/'},
... )
None

Possible solutions

Perhaps the eventual solution is to refactor bioregistry to use oaklib for this.

cthoyt commented 2 years ago

First, it's interesting to note you are hacking in an ICD10WHO prefix since that's not in the Bioregistry. There's an ongoing discussion about ICD prefixes in https://github.com/biopragmatics/bioregistry/issues/251 and a list of existing ICD prefixes at https://bioregistry.io/collection/0000004. I'm happy to accept suggestions to add additional prefixes if there's a compelling case why they are different from existing prefixes (though nobody has yet written their thoughts in a cohesive, actionable way).

>>> curie_from_iri(
        'https://icd.who.int/browse10/2019/en#/R63.8',
        prefix_map={'ICD10WHO': 'https://icd.who.int/browse10/2019/en#/'},
)
'ICD10WHO:R63.8'

>>> curie_from_iri(
        '<https://icd.who.int/browse10/2019/en#/R63.8>',
        prefix_map={'ICD10WHO': 'https://icd.who.int/browse10/2019/en#/'},
)
None

Valid IRIs don't have chevrons <> around them. Perhaps these are artifacts from directly reading an RDF document? You can simply strip your string s.lstrip("<").rstrip(">") so that you can retrieve a valid IRI during pre-processing of your data. I think it's reasonable for the bioregistry.curie_from_iri() to continue to accept only valid IRIs, so we're not going to do anything to address this within the Bioregistry package.

Perhaps the eventual solution is to refactor bioregistry to use oaklib for this.

The Bioregistry is a general tool, and oaklib is an ontology-specific tool (with many OBO-specific and even project-specific assumptions) so this doesn't make sense. That being said, there are a lot of tools built in to the Bioregistry to support ontology/OBO-specific use cases.

Further, the Bioregistry doesn't have any major dependencies for its normal functionality, and it is advantageous to keep it that way so it can be better integrated in other projects.

https://github.com/biopragmatics/bioregistry/blob/94bb381ad342ac1e5151ca98432f502398ab2cf1/setup.cfg#L46-L53

joeflack4 commented 2 years ago

First, it's interesting to note you are hacking in an ICD10WHO prefix since that's not in the Bioregistry.

Nico recommended that I use bioregistry as more of a library in the short term, like OAK. I understand that this isn't the primary use case.

Regarding the ICD10WHO prefix itself, it tends to be the same as "ICD10". However for Mondo work, we decided to add the WHO part for disambiguation, as we were having issues where sometimes ICD10 was ICD10CM, and other times it was the WHO variation (and perhaps other variations). This IMO is a mistake on WHO's end.

ongoing discussion about ICD prefixes in https://github.com/biopragmatics/bioregistry/issues/251 ... I'm happy to accept suggestions to add additional prefixes

I looked at the issue, and I remember seeing that before. Actually, it looks like the ICD10WHO prefix name is thoroughly discussed in there. As far as the URI goes, Mondo has moved from https://icd.who.int/browse10/2010/en#/ to https://icd.who.int/browse10/2019/en#/ (updated year). Actually @matentzn unfortunately this does not seem very stable, as the year seems somewhat arbitrary and unstable. We don't maintain ICD10WHO of course, so not sure if there is anything better that we can do other than periodically use the latest browser as our prefix URI.

Valid IRIs don't have chevrons <> around them. Perhaps these are artifacts from directly reading an RDF document? You can simply strip your string... ...That being said, there are a lot of tools built in to the Bioregistry to support ontology/OBO-specific use cases.

They are and I eventually did. I was recommended to use this for a library use case; I understand that that isn't the primary intended use case. But it looks like you are saying (RE: ontology/OBO) that ontology engineering library functions are an intended use case. If so, then working around these chevrons should be supported.

Bioregistry doesn't have any major dependencies

I'm surprised. But I hear you there. If there's not a major gain to including additional dependencies, might as well leave out.