althonos / pronto

A Python frontend to (Open Biomedical) Ontologies.
https://pronto.readthedocs.io
MIT License
231 stars 48 forks source link

isa: DANGLING:123 causes parse error #226

Open matentzn opened 6 months ago

matentzn commented 6 months ago

We have the following problem in the latest Mondo release:

[Term]
id: MONDO:0021125
name: disease characteristic
def: "An attribute of a disease." [https://orcid.org/0000-0002-6601-2165]
synonym: "disease qualifier" EXACT []
synonym: "modifier" EXACT [NCIT:C41009]
synonym: "qualifier" EXACT [NCIT:C41009]
xref: NCIT:C41009 {source="MONDO:equivalentTo"}
is_a: PATO:0000001
property_value: exactMatch NCIT:C41009

When running:

runoak --input pronto:$< info MONDO:0000001 

is causing:

KeyError: 'PATO:0000001'

When running:

fastobo-validator mondo.obo
     Parsing `mondo.obo`
    Finished parsing `mondo.obo` in 0.73s
   Completed validation of `mondo.obo`

Everything is all good.

When removing the isa statement above:

[Term]
id: MONDO:0021125
name: disease characteristic
def: "An attribute of a disease." [https://orcid.org/0000-0002-6601-2165]
synonym: "disease qualifier" EXACT []
synonym: "modifier" EXACT [NCIT:C41009]
synonym: "qualifier" EXACT [NCIT:C41009]
xref: NCIT:C41009 {source="MONDO:equivalentTo"}
property_value: exactMatch NCIT:C41009

Everything is good as well:

runoak --input pronto:mondo.obo info MONDO:0000001 
MONDO:0000001 ! disease

As there are thousands of dangling classes in mondo.obo - what seems to be the problem?

gouttegd commented 6 months ago

The KeyError is thrown by the symmetrize_lineage method in the pronto.parsers.base.BaseParser class:

def symmetrize_lineage(self):
    for getter in self._entities.values():
        entities, graphdata = getter(self.ont)
        for entity in entities():
            graphdata.lineage.setdefault(entity.id, Lineage())
        for subentity, lineage in graphdata.lineage.items():
            for superentity in lineage.sup:
                graphdata.lineage[superentity].sub.add(subentity)

which is itself called at the end of the OBO parser parse_from method:

def parse_from(self, handle, threads=None):
    […]
    # Update lineage cache with symmetric of `SubClassOf`
    self.symmetrize_lineage()

Overall, it seems there is an assumption here that when a class is a subclass of another, the parent class must exist somewhere in the graph. This does not take into account the possibility of dangling is_a references, which are explicitly acknowledged by the OBO specification (§6.1.2) – and for which the OBO Flat File Format Guide recommends (§S.3.4) that they should be silently accepted without yielding an error.

cmungall commented 6 months ago

Potential duplicate with #225