INCATools / ontology-access-kit

Ontology Access Kit: A python library and command line application for working with ontologies
https://incatools.github.io/ontology-access-kit/
Apache License 2.0
121 stars 29 forks source link

adapter.entity_metadata_map("HP:0000001") with obo adapter causes `DuplicateURIPrefixes` error #702

Closed matentzn closed 5 months ago

matentzn commented 8 months ago

Version: oaklib 0.5.25

Replicates with both pronto and simpleobo adapters

Minimal test

from oaklib import get_adapter

example = """
format-version: 1.2
data-version: hp/releases/2024-02-25
ontology: hp.obo

[Term]
id: HP:0000001
name: All
"""

file_path = "example.obo"

# Open the file in write mode ('w'). This will create the file if it does not exist
# or overwrite it if it does.
with open(file_path, 'w') as file:
    # Write the string to the file
    file.write(example)

adapter = get_adapter("simpleobo:example.obo")
m = adapter.entity_metadata_map("HP:0000001")
Error: DuplicateURIPrefixes ``` DuplicateURIPrefixes Traceback (most recent call last) Cell In[16], [line 22](vscode-notebook-cell:?execution_count=16&line=22) [19](vscode-notebook-cell:?execution_count=16&line=19) file.write(example) [21](vscode-notebook-cell:?execution_count=16&line=21) adapter = get_adapter("simpleobo:example.obo") ---> [22](vscode-notebook-cell:?execution_count=16&line=22) m = adapter.entity_metadata_map("HP:0000001") File [~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/implementations/simpleobo/simple_obo_implementation.py:620](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/implementations/simpleobo/simple_obo_implementation.py:620), in SimpleOboImplementation.entity_metadata_map(self, curie) [618](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/implementations/simpleobo/simple_obo_implementation.py:618) m[DEPRECATED_PREDICATE].append(True) [619](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/implementations/simpleobo/simple_obo_implementation.py:619) m[HAS_OBSOLESCENCE_REASON].append(TERMS_MERGED) --> [620](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/implementations/simpleobo/simple_obo_implementation.py:620) self.add_missing_property_values(curie, m) [621](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/implementations/simpleobo/simple_obo_implementation.py:621) return dict(m) File [~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:1460](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:1460), in BasicOntologyInterface.add_missing_property_values(self, curie, metadata_map) [1458](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:1458) if PREFIX_PREDICATE not in metadata_map: [1459](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:1459) metadata_map[PREFIX_PREDICATE] = [prefix] -> [1460](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:1460) uri = self.curie_to_uri(curie, False) [1461](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:1461) if uri: [1462](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:1462) if URL_PREDICATE not in metadata_map: File [~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:240](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:240), in BasicOntologyInterface.curie_to_uri(self, curie, strict) [238](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:238) raise ValueError(f"Invalid CURIE: {curie}") [239](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:239) return None --> [240](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:240) rv = self.converter.expand(curie) [241](~/.pyenv/versions/3.11.7/envs/babelon/lib/python3.11/site-packages/oaklib/interfaces/basic_ontology_interface.py:241) if rv is None and strict: ... http://www.geneontology.org/formats/oboInOwl#: prefix='oio' uri_prefix='http://www.geneontology.org/formats/oboInOwl#' prefix_synonyms=[] uri_prefix_synonyms=[] pattern=None prefix='oboInOwl' uri_prefix='http://www.geneontology.org/formats/oboInOwl#' prefix_synonyms=[] uri_prefix_synonyms=[] pattern=None ```
gouttegd commented 8 months ago

You might need to share more details about your environment, because I cannot replicate here.

Tried in a clean virtualenv with the latest oaklib 0.5.25, it works just fine.

Also tried with a clean virtualenv setup with babelon 0.2.4 (in case the problem came from a Babelon-specific dependency), same: no errors at all.

Full list of packages with their version: ``` Package Version -------------------------- --------------- airium 0.2.6 annotated-types 0.6.0 antlr4-python3-runtime 4.9.3 anyio 4.3.0 appdirs 1.4.4 arrow 1.3.0 attrs 23.2.0 Babel 2.14.0 babelon 0.2.4 bcp47 0.0.4 beautifulsoup4 4.12.3 cattrs 23.2.3 certifi 2024.2.2 CFGraph 0.2.1 chardet 5.2.0 charset-normalizer 3.3.2 class_resolver 0.4.3 click 8.1.7 click-default-group 1.2.4 colorama 0.4.6 curies 0.7.7 Deprecated 1.2.14 deprecation 2.1.0 distro 1.9.0 EditorConfig 0.12.4 et-xmlfile 1.1.0 eutils 0.6.0 fastobo 0.12.3 fqdn 1.5.1 funowl 0.2.3 ghp-import 2.1.0 graphviz 0.20.1 h11 0.14.0 hbreader 0.9.1 httpcore 1.0.4 httpx 0.27.0 idna 3.6 ijson 3.2.3 importlib-metadata 7.0.1 importlib_resources 6.1.2 iniconfig 2.0.0 isodate 0.6.1 isoduration 20.11.0 Jinja2 3.1.3 jsbeautifier 1.15.1 json-flattener 0.1.9 jsonasobj 1.3.1 jsonasobj2 1.0.4 jsonpatch 1.33 jsonpath-ng 1.6.1 jsonpointer 2.4 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 kgcl-rdflib 0.5.0 kgcl_schema 0.6.4 lark 1.1.9 linkml 1.7.5 linkml-dataops 0.1.0 linkml-renderer 0.3.0 linkml-runtime 1.7.2 llm 0.13.1 lxml 5.1.0 Markdown 3.5.2 MarkupSafe 2.1.5 mergedeep 1.3.4 mkdocs 1.5.3 mkdocs-material 9.5.11 mkdocs-material-extensions 1.3.1 mkdocs-mermaid2-plugin 0.6.0 more-click 0.1.2 ndex2 3.8.0 networkx 3.2.1 numpy 1.26.4 oaklib 0.5.25 ols-client 0.1.4 ontoportal-client 0.0.4 openai 1.12.0 openpyxl 3.1.2 packaging 23.2 paginate 0.5.6 pandas 2.2.1 pansql 0.0.1 parse 1.20.1 pathspec 0.12.1 pip 23.3.2 platformdirs 4.2.0 pluggy 1.4.0 ply 3.11 prefixcommons 0.1.12 prefixmaps 0.2.2 pronto 2.5.6 pydantic 2.6.3 pydantic_core 2.16.3 Pygments 2.17.2 PyJSG 0.11.10 pymdown-extensions 10.7 pyparsing 3.1.1 PyShEx 0.8.1 PyShExC 0.9.1 pysolr 3.9.0 pystow 0.5.3 pytest 8.0.2 pytest-logging 2015.11.4 python-dateutil 2.8.2 python-dotenv 1.0.1 python-ulid 2.2.0 PyTrie 0.4.0 pytz 2024.1 PyYAML 6.0.1 pyyaml_env_tag 0.1 ratelimit 2.2.1 rdflib 7.0.0 rdflib-jsonld 0.6.1 rdflib-shim 1.0.3 referencing 0.33.0 regex 2023.12.25 requests 2.31.0 requests-cache 1.2.0 requests-toolbelt 1.0.0 rfc3339-validator 0.1.4 rfc3987 1.3.8 rpds-py 0.18.0 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 scipy 1.12.0 semsimian 0.2.12 semsql 0.3.3 setuptools 69.0.3 ShExJSG 0.8.2 six 1.16.0 sniffio 1.3.1 sortedcontainers 2.4.0 soupsieve 2.5 sparqlslurper 0.5.1 SPARQLWrapper 2.0.0 SQLAlchemy 2.0.27 SQLAlchemy-Utils 0.38.3 sqlite-fts4 1.0.3 sqlite-migrate 0.1b0 sqlite-utils 3.36 sssom 0.4.4 sssom-schema 0.15.0 tabulate 0.9.0 tqdm 4.66.2 types-python-dateutil 2.8.19.20240106 typing_extensions 4.10.0 tzdata 2024.1 uri-template 1.3.0 url-normalize 1.4.3 urllib3 2.2.1 validators 0.22.0 watchdog 4.0.0 webcolors 1.13 wheel 0.42.0 wrapt 1.16.0 xmltodict 0.13.0 zipp 3.17.0 ```
matentzn commented 8 months ago

The joy of pip install -U. :/ Thanks for making me think in this direction (other dependencies). It was, indeed, an older 0.6.X curies version that caused the issue. Sorry about the noise.

matentzn commented 8 months ago

Reopening as it was indeed an issue. This does not work:

from oaklib import get_adapter

example = """
format-version: 1.2
data-version: hp/releases/2024-02-25
default-namespace: human_phenotype
idspace: dc http://purl.org/dc/elements/1.1/ 
idspace: oboInOwl http://www.geneontology.org/formats/oboInOwl# 
idspace: owl http://www.w3.org/2002/07/owl# 
idspace: rdf http://www.w3.org/1999/02/22-rdf-syntax-ns# 
idspace: rdfs http://www.w3.org/2000/01/rdf-schema# 
idspace: terms http://purl.org/dc/terms/ 
idspace: xml http://www.w3.org/XML/1998/namespace 
idspace: xsd http://www.w3.org/2001/XMLSchema# 
ontology: hp.obo

[Term]
id: HP:0000001
name: All
"""

file_path = "example.obo"

# Open the file in write mode ('w'). This will create the file if it does not exist
# or overwrite it if it does.
with open(file_path, 'w') as file:
    # Write the string to the file
    file.write(example)

adapter = get_adapter("pronto:example.obo")
m = adapter.entity_metadata_map("HP:0000001")
print(m)

If you remove

idspace: oboInOwl http://www.geneontology.org/formats/oboInOwl# 

it does. This suggests that we need to somehow handle this for the day when @balhoff PR is merged.

gouttegd commented 8 months ago

As far as I understand, the problem is as follows:

1) The BasicOntologyInterface’s prefix_map() default implementation creates a default prefix map made of the “OBO context”. Presumably the OBO context map contains an entry oio -> http://www.geneontology.org/formats/oboInOwl#.

2) The ProntoImplementation’s __post_init__() method adds to that default prefix map the prefixes declared in the OBO file’s idspace tags:

for prefix, expansion in ontology.metadata.idspaces.items():
    self.prefix_map()[prefix] = expansion[0]

(The SimpleOboImplementation does the same thing.)

3) Now the prefix map contains both oio -> http://www.geneontology.org/formats/oboInOwl# (from the OBO context) and oboInOwl -> http://www.geneontology.org/formats/oboInOwl# (from the ontology’s own map).

4) The curies converter does not like that at all and error out.

I am not sure I understand why having two prefix names pointing to the same prefix must be an error. I understand that the other way round (the same prefix name pointing to two different prefixes) would obviously be wrong (but that cannot happen here, since existing prefix names in the OBO context would be automatically replaced by the declared prefix name), but not in that direction.

Anyway, if we indeed consider that it is wrong to have two prefix names pointing to the same URL prefix, both the Pronto and the SimpleOBO implementation must be amended because the 2-lines code highlighted above is too naive: instead of simply adding the content of the idspace declaration to the existing prefix map, it must before check whether the prefix map already contains another prefix name pointing to the same URL prefix, and remove it.

gouttegd commented 8 months ago

By the way, anyone could run into this problem anytime, independently of @balhoff ’s PR. His PR merely makes it more likely to come across OBO files containing idspace tags, but anyone can already craft OBO files with such tags if they want.

hrshdhgd commented 8 months ago

Solution to this: In basic_ontology_interface.py, this line needs to be

self._converter = curies.Converter.from_prefix_map(self.prefix_map(), strict=False)

This asks the curies package to be less strict and allow duplicate prefixes. As you can see it's an easy fix.

The questions are:

cc: @cmungall

matentzn commented 8 months ago

Another possible fix would be to fix

https://github.com/INCATools/ontology-access-kit/blob/15bf85cefc2fe8541b38aabfcf7c65eb46bc1231/src/oaklib/interfaces/basic_ontology_interface.py#L58

So the way the prefixmap is contracted. If the way we use it in sssom-py was used (with ChainMap) it would allow the creating of a prefixmap with precedence rules that would result in a consistent final product. I assume that having conflicting prefixmaps (multiple prefixes for the same URI) could be confusing for the day to day busines.s.