ebi-chebi / ChEBI

Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds.
https://www.ebi.ac.uk/chebi
Creative Commons Attribution 4.0 International
46 stars 10 forks source link

Current OBO file fails to parse via OWL API (Protege, ROBOT) #4273

Open balhoff opened 2 years ago

balhoff commented 2 years ago

There is a term in the ChEBI OBO file with a label containing an opening brace:

[Term]
id: CHEBI:187876
name: N(4)-{beta-D-GlcNAc-(1->2)-alpha-D-Man-(1->3)-[alpha-D-Man-(1->6)]-beta-D-Man-(1->4)-beta-D-GlcNAc-(1->4)-[alpha-D-Fuc-(1->6)]-beta-D-GlcNAc}-Asn residue
is_obsolete: true

In OBO syntax the opening brace starts a trailing qualifier, leading to a parse error. The brace needs to be escaped like this:

name: N(4)-\{beta-D-GlcNAc-(1->2)-alpha-D-Man-(1->3)-[alpha-D-Man-(1->6)]-beta-D-Man-(1->4)-beta-D-GlcNAc-(1->4)-[alpha-D-Fuc-(1->6)]-beta-D-GlcNAc}-Asn residue

Here is the stack trace output from ROBOT:

Parser: org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser@60e5272
    Stack trace:
LINENO: 2923981 - Missing '=' in trailing qualifier block. This might happen for not properly escaped '{', '}' chars in comments.
LINE: name: N(4)-{beta-D-GlcNAc-(1->2)-alpha-D-Man-(1->3)-[alpha-D-Man-(1->6)]-beta-D-Man-(1->4)-beta-D-GlcNAc-(1->4)-[alpha-D-Fuc-(1->6)]-beta-D-GlcNAc}-Asn residue        org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:60)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyFactoryImpl.loadOWLOntology(OWLOntologyFactoryImpl.java:220)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.actualParse(OWLOntologyManagerImpl.java:1254)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1208)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntologyFromOntologyDocument(OWLOntologyManagerImpl.java:1165)
        org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:539)
        org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:425)
        org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:306)
        org.obolibrary.robot.CommandLineHelper.getInputOntology(CommandLineHelper.java:483)
        org.obolibrary.robot.CommandLineHelper.updateInputOntology(CommandLineHelper.java:581)
LINENO: 2923981 - Missing '=' in trailing qualifier block. This might happen for not properly escaped '{', '}' chars in comments.
LINE: name: N(4)-{beta-D-GlcNAc-(1->2)-alpha-D-Man-(1->3)-[alpha-D-Man-(1->6)]-beta-D-Man-(1->4)-beta-D-GlcNAc-(1->4)-[alpha-D-Fuc-(1->6)]-beta-D-GlcNAc}-Asn residue        org.obolibrary.oboformat.parser.OBOFormatParser.error(OBOFormatParser.java:1465)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseQual(OBOFormatParser.java:1207)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseZeroOrMoreQuals(OBOFormatParser.java:1196)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseQualifierBlock(OBOFormatParser.java:1186)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseUnquotedString(OBOFormatParser.java:1288)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrameClause(OBOFormatParser.java:622)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrameClauseEOL(OBOFormatParser.java:598)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrame(OBOFormatParser.java:572)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseEntityFrame(OBOFormatParser.java:539)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseOBODoc(OBOFormatParser.java:349)

cc @kltm

amalik01 commented 2 years ago

Thanks for pointing out the issue. We will try and get this fixed in ChEBI 2.0 (a project we are currently working on to redevelop ChEBI's ageing infrastructure). In the meantime I have replaced the curly brackets in the ChEBI name with square brackets which should temporary fix the issue in next months release.

matentzn commented 2 years ago

Can we get some information about what Chebi 2.0 exactly is?

I am also forced to do quite a bit of debugging due to CHEBI having OBO format issues. I recommend adding some minimal CI, like a roundtrip through ROBOT, and a fastobo-validator (https://github.com/fastobo/fastobo-validator).

Parsing `mirror/chebi.owl.tmp.obo`
      Failed parsing `mirror/chebi.owl.tmp.obo`
              --> mirror/chebi.owl.tmp.obo:863257:14
               |
        863257 | xref: KEGG:C 2339 ␊
               |              ^---
               |
               = expected EOL or QuotedString

While it you are working on CHEBI 2.0, would it be possible to patch the release files so they can be parsed?

cmungall commented 2 years ago

I strongly second @matentzn's suggestion. It would be very easy to add a fastobo validation check with the existing CHEBI infrastructure, just run the check prior to a release or during a snapshot.

amalik01 commented 2 years ago

@matentzn The issue has now been fixed in the nightly OBO file (https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/nightly/chebi.obo). Hopefully when the new release is completed in the next few days, it will also be fixed in the monthly chebi.obo file (https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.obo).

amalik01 commented 2 years ago

We have just been awarded a 3 year BBSRC grant to redevelop ChEBI so the back-end and front-end infrastructure will be redeveloped. A new annotation tool and submission tool will be built, searching and ontology visualization will be improved. The current SOAP based web-services will be replaced by REST. We also plan to move away from commercial software such as Oracle to PostgreSQL.

matentzn commented 2 years ago

Great, thank you for addressing this! What would be great if you could work with the OBO community to introduce a CI Testing system for the ontology along the lines of what other ontologies implement! We would be happy to assist on OBO slack!