`obonet` is incapable to parse the `definition` terms correctly in obo files

erikyao commented 3 years ago

Version

obonet==0.3.0

Related To

Priority

Low. Currently it's not an issue. Maybe an issue in the future.

Problem

The def field of obo ontology has a format of <def_string> [<dbxref>]. See GO.format.obo-1_4.html#S.2.2.

Library obonet will read such a field incorrectly into a whole string. E.g.

'"A ribonucleoprotein complex that contains an RNA molecule ..." [GOC:sgd_curators, PMID:10690410, PMID:14729943, PMID:7510714]'

However the the def fields within the current ChEBI obo file all have empty <dbxref> lists. Our current implenentation is to trim them from the string values of def fields. E.g.

 '"A macrocyclic lactone with a ring of twelve or more members derived from a polyketide." []'

will be trimmed to

'A macrocyclic lactone with a ring of twelve or more members derived from a polyketide.'

Note that the quotes inside will also be removed.

Our current implementation cannot handle any def field with a non-empty <dbxref> list.

Solution

pronto is another library to read obo files. It's more heavy-weight yet low-level. It has a clear class hierarchy but at the same time not well-documented. An alternative implementation to the OntologyReader in chebi_parser.py is ProntoOntologyReader.py.

Performance-wise:

pronto is about 4-times slower than obonet
- E.g. with rel201/chebi_lite.obo, ProntoOntologyReader uses ~150 seconds to load the file and generate all 146,183 documents, while our implementation with obonet uses only ~30 seconds.
pronto uses slightly more memory than obonet

We can also watch for the update of obonet on this issue.

newgene commented 1 year ago

@DylanWelzel obonet now on v1.0.0. Let's re-evaluate if we still need to keep our local fix for this issue.

DylanWelzel commented 1 year ago

obonet v1.0.0 still does not correctly parse the def field. The local fix will stay but I've updated the obonet version to v1.0.0 in the requirements.

biothings / mychem.info