fastobo.id.is_valid returns False for URLs containing non-ASCII characters

jggatter commented 5 months ago

Hello,

The newest release of hancestro.owl adds an entry with resources that are problematic for parsing:

    <!-- http://purl.obolibrary.org/obo/AfPO_0000285 -->

    <owl:Class rdf:about="http://purl.obolibrary.org/obo/AfPO_0000285">
        <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/AfPO_0000281"/>
        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/HANCESTRO_0308"/>
                <owl:someValuesFrom rdf:resource="http://dbpedia.org/resource/Republic_of_the_Congo"/>
            </owl:Restriction>
        </rdfs:subClassOf>
        <obo:AfPO_0000089>Efe
1°53&apos;N, 29°18&apos;E
1.88, 29.29</obo:AfPO_0000089>
        <obo:AfPO_0000089>Lese
1°18&apos;N, 29°18&apos;E
1.30, 29.30</obo:AfPO_0000089>
        <obo:AfPO_0000223 rdf:resource="http://glottolog.org/resource/languoid/id/efee1239.bigmap.html"/>
        <obo:AfPO_0000223 rdf:resource="http://glottolog.org/resource/languoid/id/lese1243.bigmap.html"/>
        <obo:AfPO_0000230 rdf:resource="http://glottolog.org/resource/languoid/id/efee1239"/>
        <obo:AfPO_0000230 rdf:resource="http://glottolog.org/resource/languoid/id/lese1243"/>
        <obo:AfPO_0000233 rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://www.google.com/search?tbm=bks&amp;hl=fr&amp;q=%28+&apos;Efé+people&apos;+OR+&apos;Efé+tribe&apos;+%29+AND+&apos;Africa&apos;</obo:AfPO_0000233>
        <obo:AfPO_0000234 rdf:resource="https://en.wikipedia.org/wiki/Lese_language"/>
        <obo:AfPO_0000235 rdf:resource="https://en.wikipedia.org/wiki/Efé_people"/>
        <obo:AfPO_0000452>efee1239</obo:AfPO_0000452>
        <obo:AfPO_0000452>lese1243</obo:AfPO_0000452>
        <obo:AfPO_0000458>Lese</obo:AfPO_0000458>
        <obo:AfPO_0000459>0.07 million</obo:AfPO_0000459>
        <obo:AfPO_0000565>Nilo-Saharan</obo:AfPO_0000565>
        <obo:IAO_0000115>A Pygmy Central population with a population size of 0.07 million</obo:IAO_0000115>
        <rdfs:label>Efé</rdfs:label>
    </owl:Class>
    <owl:Axiom>
        <owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/AfPO_0000285"/>
        <owl:annotatedProperty rdf:resource="http://purl.obolibrary.org/obo/IAO_0000115"/>
        <owl:annotatedTarget>A Pygmy Central population with a population size of 0.07 million</owl:annotatedTarget>
        <oboInOwl:hasDbXref>https://en.wikipedia.org/wiki/Efé_people</oboInOwl:hasDbXref>
    </owl:Axiom>

Pronto is unable to parse the above resource, <obo:AfPO_0000235 rdf:resource="https://en.wikipedia.org/wiki/Efé_people"/>:

File "/home/drnk/miniconda3/envs/cynapse-env/lib/python3.9/site-packages/cyvocab/ontologies/_ontologies.py", line 61, in ontology
    ontology = pronto.ontology.Ontology(handle)
  File "/home/drnk/miniconda3/envs/cynapse-env/lib/python3.9/site-packages/pronto/ontology.py", line 283, in __init__
    cls(self).parse_from(_handle)  # type: ignore
  File "/home/drnk/miniconda3/envs/cynapse-env/lib/python3.9/site-packages/pronto/parsers/rdfxml.py", line 117, in parse_from
    self._extract_term(class_, curies)
  File "/home/drnk/miniconda3/envs/cynapse-env/lib/python3.9/site-packages/pronto/parsers/rdfxml.py", line 462, in _extract_term
    termdata.annotations.add(self._extract_resource_pv(child))
  File "/home/drnk/miniconda3/envs/cynapse-env/lib/python3.9/site-packages/pronto/parsers/rdfxml.py", line 229, in _extract_resource_pv
    return ResourcePropertyValue(property, resource)
  File "/home/drnk/miniconda3/envs/cynapse-env/lib/python3.9/site-packages/pronto/utils/meta.py", line 96, in newfunc
    return func(*args, **kwargs)
  File "/home/drnk/miniconda3/envs/cynapse-env/lib/python3.9/site-packages/pronto/pv.py", line 104, in __init__
    raise ValueError("invalid identifier: {}".format(resource))
ValueError: invalid identifier: [https://en.wikipedia.org/wiki/Efé_people](https://en.wikipedia.org/wiki/Ef%C3%A9_people)

See https://github.com/althonos/pronto/blob/master/pronto/pv.py#L104

The testing below suggests that the é character is to blame:

Python 3.9.19 (main, Apr  6 2024, 17:57:55) 
[GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fastobo
>>> fastobo.id.is_valid('https://en.wikipedia.org/wiki/Efé_people')
False
>>> fastobo.id.is_valid('https://en.wikipedia.org/wiki/Efe_people')
True
>>> fastobo.id.is_valid('https://en.wikipedia.org/wiki/Ef%C3%A9_people')
True

I don't really know much about OBO standards, so perhaps this is the intended behavior in fastobo. In any case I felt it was worth asking about here! I'll report this to Hancestro as well to see if they can use the url-safe version I show in the example above above.

Thanks, James

jggatter commented 5 months ago

It is a good guideline that valid resource URLs contain only ASCII characters, even if web browsers can handle UTF-8 characters like the example above. I could see other ontologies not ensuring this though.

matentzn commented 5 months ago

This issue is actually two issues I believe:

While the spec is not all to clear about this, the value of xrefs should be interpreted as IDs, which means they should be interpreted as IRIs in any of the RDF serialisations, which means they have full UTF-8 support. However, in practice, we interpret the values of xrefs as strings, and our prevalent serialisers do so too.
The OBO format explicitly says that it supports UTF-8, so pronto should, according to that, not fail here.

I am a bit torn. We usually recommend encoding non-ASCII characters to ensure interoperability across systems, but technically, they should be allowed. I would probably recommend to:

Allow them in pronto/fastobo as the spec permits them
Change them in the ontology anyways for greater consistency and interoperability

jggatter commented 4 months ago

Thanks for the quick reply @matentzn! Sorry I am slow to respond. In the HANCESTRO issue I opened, https://github.com/EBISPOT/hancestro/issues/58, I informed them of your response.

Just curious, when could UTF-8 support be expected in pronto/fastobo? I'm not blocked by this issue, so it's no longer urgent to me. I'll continue using an older version of the HANCESTRO ontology.

matentzn commented 4 months ago

when could UTF-8 support be expected in pronto/fastobo? I'm not blocked by this issue, so it's no longer urgent to me

This is an @althonos question!

althonos commented 4 months ago

I transfered this issue to the fastobo repo, since this is a syntax issue. Either I fucked up the RFC3987 syntax implementation for IRIs, or there is a bug that causes the URL to be parsed as a prefixed identifier instead of an IRI ...

fastobo / fastobo

fastobo.id.is_valid returns False for URLs containing non-ASCII characters #68