geneontology / neo

noctua entity ontology
9 stars 2 forks source link

NEO no longer building #69

Closed kltm closed 3 years ago

kltm commented 3 years ago

Currently, the NEO ontology build no longer succeeds on errors like:

11:34:02  [Fatal Error] :1:1: Content is not allowed in prolog.
11:34:02  2021-05-24 11:34:02,183 ERROR (CommandRunner:4815) could not parse:target/neo-wb.obo
 11:34:02  org.semanticweb.owlapi.io.UnparsableOntologyException: Problem parsing file:/var/lib/jenkins/workspace/ology_pipeline_issue-35-neo-test/neo/target/neo-wb.obo

For examination, I've grabbed temporarily grabbed that neo-wb.obo file and made it available here: http://skyhook.berkeleybop.org/neo-wb.obo

It seems like there may be a WormBase issues that is related (an expansion to the WB GPI that happened in the right timeframe), but I've been unable to find it again; in my notes I have "WormBase/website/issues/8222", but this doesn't seem to correspond to anything. @vanaukenk , would you maybe know the correct public reference for this?

Tagging @balhoff @vanaukenk

kltm commented 3 years ago

WB GPI upstream: ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA13758/annotation/gene_product_info/c_elegans.PRJNA13758.current.gene_product_info.gpi.gz

Unlikely to be utf-8 related:

sjcarbon@moiraine:/tmp$:) iconv -f utf-8 -t ascii//TRANSLIT neo-wb.obo > shin-wb.obo
sjcarbon@moiraine:/tmp$:) md5sum neo-wb.obo shin-wb.obo 
798a9647ea81b7916ba216bd160c0891  neo-wb.obo
798a9647ea81b7916ba216bd160c0891  shin-wb.obo
sjcarbon@moiraine:/tmp$:) iconv -f utf-8 -t ascii//TRANSLIT c_elegans.PRJNA13758.current.gene_product_info.gpi > shin.gpi
sjcarbon@moiraine:/tmp$:) md5sum c
c_elegans.PRJNA13758.current.gene_product_info.gpi
config-err-CDbXVN
sjcarbon@moiraine:/tmp$:) md5sum c_elegans.PRJNA13758.current.gene_product_info.gpi shin.gpi 
238945d16c1fb05f858c47f98e0a7bf1  c_elegans.PRJNA13758.current.gene_product_info.gpi
238945d16c1fb05f858c47f98e0a7bf1  shin.gpi
balhoff commented 3 years ago

@kltm it would help to see more of the exception if possible. That looks like a typical OWL API parsing error where the actual error is usually way up in the trace.

kltm commented 3 years ago

@balhoff NP! A more full picture of the error here: https://gist.github.com/kltm/c8b51517817b2758d48c7d566e4f1403

balhoff commented 3 years ago

The problem is line 104844 (relationship: in_taxon missing taxon):

[Term]
id: WB:WBGene00010290
name: WBGene00010290 Cele
synonym: "WBGene00010290" BROAD []
synonym: "gene" RELATED []
synonym: "WBGene00010290" RELATED []
is_a: CHEBI:33695 ! information biomacromolecule
relationship: in_taxon 
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProduct
relationship: has_gene_template UniProtKB:G5EBT2
vanaukenk commented 3 years ago

Hi @kltm @balhoff

I just checked the line entry for WBGene00010290 in our source gpi and can't find an obvious error there:

WB WBGene00010290 nrap-1|CELE_F58H1.7 gene taxon:6239 UniProtKB:G5EBT2

Anything else we can check?

kltm commented 3 years ago

@vanaukenk Okay, looking at the GPI a little, I think it may be the the error is in there, rather than the GPI->OBO parser. This is a little hard to follow maybe, but there are three lines in the GPI for WB:WBGene00010290. The first one is the one that seems to be problematic, the other two are just instructive on what I think the issue might be.

what I am doing is listing these three lines and converting [TAB] to ^ so that they are easily visible:

sjcarbon@moiraine:/tmp$:( zgrep WBGene00010290 c_elegans.PRJNA13758.current.gene_product_info.gpi.gz | tr " " "\#" | tr "\t" "\^"
WB^WBGene00010290^^nrap-1|CELE_F58H1.7^gene^taxon:6239^^UniProtKB:G5EBT2^
WB^F58H1.7a^^nrap-1|CELE_F58H1.7^transcript^taxon:6239^WB:WBGene00010290^^
WB^F58H1.7b^^nrap-1|CELE_F58H1.7^transcript^taxon:6239^WB:WBGene00010290^^

It seems like the problematic line (the first one) may be shifted as far as tabbing goes, with two tabs after the taxon (instead of one?) and one tab after that (instead of two?). Does this seem anomalous to you?

vanaukenk commented 3 years ago

Thanks @kltm

On closer inspection, it looks like this may not be a tabbing error but rather that these three lines are missing a value for the required column 3 entry, DB_Object_Symbol.

Looking at the underlying data in WB, there may be an explanation for this, but I'm going to check with our Hinxton team to see if it's correct and how we might be able to fix it.

I'll post again as soon as I have an answer.

kltm commented 3 years ago

@vanaukenk Okay, no problem. We can also experiment with 1) yanking this line and/or 2) seeing if there are other lines like this in your file.

vanaukenk commented 3 years ago

@kltm We have a new, corrected file that just needs to be synced to our ftp site (should be coming very soon). If you can wait a bit to try again with that file, we can see how things go. We're also implementing a QC check on our end to make sure no required fields are missing. This was a puzzling bug; we're not entirely certain why that field value went missing for those entries....

kltm commented 3 years ago

This seems to be good now.