Closed kltm closed 3 years ago
WB GPI upstream: ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA13758/annotation/gene_product_info/c_elegans.PRJNA13758.current.gene_product_info.gpi.gz
Unlikely to be utf-8 related:
sjcarbon@moiraine:/tmp$:) iconv -f utf-8 -t ascii//TRANSLIT neo-wb.obo > shin-wb.obo
sjcarbon@moiraine:/tmp$:) md5sum neo-wb.obo shin-wb.obo
798a9647ea81b7916ba216bd160c0891 neo-wb.obo
798a9647ea81b7916ba216bd160c0891 shin-wb.obo
sjcarbon@moiraine:/tmp$:) iconv -f utf-8 -t ascii//TRANSLIT c_elegans.PRJNA13758.current.gene_product_info.gpi > shin.gpi
sjcarbon@moiraine:/tmp$:) md5sum c
c_elegans.PRJNA13758.current.gene_product_info.gpi
config-err-CDbXVN
sjcarbon@moiraine:/tmp$:) md5sum c_elegans.PRJNA13758.current.gene_product_info.gpi shin.gpi
238945d16c1fb05f858c47f98e0a7bf1 c_elegans.PRJNA13758.current.gene_product_info.gpi
238945d16c1fb05f858c47f98e0a7bf1 shin.gpi
@kltm it would help to see more of the exception if possible. That looks like a typical OWL API parsing error where the actual error is usually way up in the trace.
@balhoff NP! A more full picture of the error here: https://gist.github.com/kltm/c8b51517817b2758d48c7d566e4f1403
The problem is line 104844 (relationship: in_taxon
missing taxon):
[Term]
id: WB:WBGene00010290
name: WBGene00010290 Cele
synonym: "WBGene00010290" BROAD []
synonym: "gene" RELATED []
synonym: "WBGene00010290" RELATED []
is_a: CHEBI:33695 ! information biomacromolecule
relationship: in_taxon
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProduct
relationship: has_gene_template UniProtKB:G5EBT2
Hi @kltm @balhoff
I just checked the line entry for WBGene00010290 in our source gpi and can't find an obvious error there:
WB WBGene00010290 nrap-1|CELE_F58H1.7 gene taxon:6239 UniProtKB:G5EBT2
Anything else we can check?
@vanaukenk
Okay, looking at the GPI a little, I think it may be the the error is in there, rather than the GPI->OBO parser. This is a little hard to follow maybe, but there are three lines in the GPI for WB:WBGene00010290
. The first one is the one that seems to be problematic, the other two are just instructive on what I think the issue might be.
what I am doing is listing these three lines and converting [TAB]
to ^
so that they are easily visible:
sjcarbon@moiraine:/tmp$:( zgrep WBGene00010290 c_elegans.PRJNA13758.current.gene_product_info.gpi.gz | tr " " "\#" | tr "\t" "\^"
WB^WBGene00010290^^nrap-1|CELE_F58H1.7^gene^taxon:6239^^UniProtKB:G5EBT2^
WB^F58H1.7a^^nrap-1|CELE_F58H1.7^transcript^taxon:6239^WB:WBGene00010290^^
WB^F58H1.7b^^nrap-1|CELE_F58H1.7^transcript^taxon:6239^WB:WBGene00010290^^
It seems like the problematic line (the first one) may be shifted as far as tabbing goes, with two tabs after the taxon (instead of one?) and one tab after that (instead of two?). Does this seem anomalous to you?
Thanks @kltm
On closer inspection, it looks like this may not be a tabbing error but rather that these three lines are missing a value for the required column 3 entry, DB_Object_Symbol.
Looking at the underlying data in WB, there may be an explanation for this, but I'm going to check with our Hinxton team to see if it's correct and how we might be able to fix it.
I'll post again as soon as I have an answer.
@vanaukenk Okay, no problem. We can also experiment with 1) yanking this line and/or 2) seeing if there are other lines like this in your file.
@kltm We have a new, corrected file that just needs to be synced to our ftp site (should be coming very soon). If you can wait a bit to try again with that file, we can see how things go. We're also implementing a QC check on our end to make sure no required fields are missing. This was a puzzling bug; we're not entirely certain why that field value went missing for those entries....
This seems to be good now.
Currently, the NEO ontology build no longer succeeds on errors like:
For examination, I've grabbed temporarily grabbed that neo-wb.obo file and made it available here: http://skyhook.berkeleybop.org/neo-wb.obo
It seems like there may be a WormBase issues that is related (an expansion to the WB GPI that happened in the right timeframe), but I've been unable to find it again; in my notes I have "WormBase/website/issues/8222", but this doesn't seem to correspond to anything. @vanaukenk , would you maybe know the correct public reference for this?
Tagging @balhoff @vanaukenk