althonos / pronto

A Python frontend to (Open Biomedical) Ontologies.
https://pronto.readthedocs.io
MIT License
231 stars 48 forks source link

invalid datatype: 'http://www.w3.org/2000/01/rdf-schema#Literal' when parsing foodon #187

Closed cmungall closed 2 years ago

cmungall commented 2 years ago

debugging mondo parsing with @hrshdhgd

I have an obo file that includes an import to a foodon extract. Note that it's conventional for ontologies to always state imports to the .owl, even in the .obo format version, so without a mechanism like catalog-v001.xml to redirect imports, pronto always follows the import and tries to parse the rdf/xml. This results in an error

This can be reproduced with a smaller example:

<?xml version="1.0"?>
<rdf:RDF xmlns="http://purl.obolibrary.org/obo/mondo/imports/foodon_import.owl#"
     xml:base="http://purl.obolibrary.org/obo/mondo/imports/foodon_import.owl"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:po="http://purl.oboInOwllibrary.org/oboInOwl/po#"
     xmlns:obo="http://purl.obolibrary.org/obo/"
     xmlns:owl="http://www.w3.org/2002/07/owl#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:xml="http://www.w3.org/XML/1998/namespace"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
     xmlns:oboInOwl="http://www.geneontology.org/formats/oboInOwl#">
    <owl:Ontology rdf:about="http://purl.obolibrary.org/obo/mondo/imports/foodon_import.owl">
        <owl:versionIRI rdf:resource="http://purl.obolibrary.org/obo/mondo/releases/2021-08-03/imports/foodon_import.owl"/>
    </owl:Ontology>

    <!-- http://purl.obolibrary.org/obo/FOODON_03309823 -->

    <owl:Class rdf:about="http://purl.obolibrary.org/obo/FOODON_03309823">
        <obo:IAO_0000119 rdf:datatype="http://www.w3.org/2000/01/rdf-schema#Literal">wikipedia:Shrimp_pastse</obo:IAO_0000119>
    </owl:Class>

</rdf:RDF>

The issue is with the datatype declaration. I believe this is valid OWL (https://www.w3.org/TR/owl2-syntax/), although it's conventional to drop this as it's implicit.

Stack trace:

  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pronto/ontology.py", line 283, in __init__
    cls(self).parse_from(_handle)  # type: ignore
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pronto/parsers/rdfxml.py", line 117, in parse_from
    self._extract_term(class_, curies)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pronto/parsers/rdfxml.py", line 446, in _extract_term
    termdata.annotations.add(self._extract_literal_pv(child))
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pronto/parsers/rdfxml.py", line 236, in _extract_literal_pv
    property, typing.cast(str, elem.text), self._compact_datatype(datatype)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pronto/parsers/rdfxml.py", line 182, in _compact_datatype
    raise ValueError(f"invalid datatype: {iri!r}")
ValueError: invalid datatype: 'http://www.w3.org/2000/01/rdf-schema#Literal'
althonos commented 2 years ago

Yes, this should be accepted as a builtin datatype, I'll make a patch.

althonos commented 2 years ago

Fixed in v2.5.1 hopefully.