biopragmatics / pyobo

📛 A Python package for using ontologies, terminologies, and biomedical nomenclatures
https://pyobo.readthedocs.io
MIT License
61 stars 14 forks source link

hgnc ingest fails with illegal prefix? #123

Closed matentzn closed 2 years ago

matentzn commented 2 years ago

Is this a good reason to fail the entire hgnc.py pipeline?

pyobo.identifier_utils.MissingPrefix: unhandled prefix ogms/OMRE found in curie ogms/OMRE:0000137

It could be, but then we would need some sort of idea on how to fix it. What das OGMS have to do with HGNC?

@kevinschaper bcc

cthoyt commented 2 years ago

Until now I have most of the PyOBO pipelines fail if there's anything unexpected, since this motivates either updating the code, more curation of the bioregistry, or to add more rewrite rules in PyOBO's config file

cthoyt commented 2 years ago

I just re-ran the HGNC converter and did not get this error. What code did you use that resulted in this?

matentzn commented 2 years ago

(@kevinschaper reported this in the Monarch slack, so we will wait until he can answer)

cthoyt commented 2 years ago

Like I said, I wasn't able to reproduce this from the HGNC converter, but I did find in AERO that there was a reference like this. I added it to the Bioregistry.

@kevinschaper if this was popping up during full a database build, you can pass -x to turn off raising exceptions. It's very strict by default in order to prompt additional curation and promote 100% data integrity before shipping each new version of the database (btw, this is an incredibly time intensive process; curation of the mess is very hard)

cthoyt commented 2 years ago

I'm going to close this issue, since it's been a few weeks with no comment from Kevin. Feel free to re-open it if you have more to add.

kevinschaper commented 2 years ago

Sorry @cthoyt - I totally missed the GitHub notification on this the first time. I was doing a full database build, I'll try it again and I can re-open if I hit the same problem.

cthoyt commented 2 years ago

Alright, sounds good. I'm doing some major work to improve the database build in #129. Throughout the process, I've found all sorts of other issues that needed fixing along the way.

I'd love to know why you're doing this yourself (besides wanting a more up-to-date version) - the results of this build get sent to Zenodo and are listed on http://biolookup.io/downloads. The new version of Biolookup, that will be released after I finish #129, will also have the synonyms, xrefs, and relationships (see https://github.com/biopragmatics/biolookup/pull/10)

kevinschaper commented 2 years ago

I think I was wandering down a path of off label usage - looking at whether we could take advantage of PyOBO's parsing as a part of the Monarch Ingest, so I was poking around trying export a whole source ("hgnc.obo"?), or maybe iterate over each entity returned by a source.

cthoyt commented 2 years ago

Not sure why you'd want to go through the whole database build to get a single ontology, but also you probably noticed that this package isn't exactly set up for end-users to access all of the low-level functionality (there specifically aren't even docs besides the README at the moment). Here's some example code for what you described, though:


import pyobo

ontology = pyobo.get_ontology("hgnc")

# Iterate over terms in the ontology
for term in ontology:
    print(term.prefix, term.identifier, term.name, term.synonyms)  # and more

# Dump it as OBO
ontology.write_obo("hgnc.obo")

# Convert OBO to OWL (requires ROBOT on PATH)
from pyobo.utils.misc import obo_to_owl
obo_to_owl("hgnc.obo", "hgnc.owl")