geneontology / neo

noctua entity ontology
9 stars 2 forks source link

In some cases bad identifiers are getting into the load #112

Open kltm opened 1 year ago

kltm commented 1 year ago

In the most recent successful load, the following error was noticed going by:

    20:35:45  2023-01-12 04:35:45,757 WARN  (OWLGraphWrapperExtended:936) Unable to retrieve the value of oboInOw#id as the identifier for http://identifiers.org/wormbase/T10C6.13%7CWB%3AF45F2.13%7CWB%3AZK131.3%7CWB%3AZK131.7%7CWB%3AK06C4.5%7CWB%3AZK131.2%7CWB%3AK06C4.13%7CWB%3AF17E9.10%7CWB%3AK03A1.1%7CWB%3AF08G2.3%7CWB%3AB0035.10%7CWB%3AF07B7.5%7CWB%3AF54E12.1%7CWB%3AF55G1.2%7CWB%3AF22B3.2; we will use an original iri as the identifier.

Nothing like this seems to be in the WB GPI. In fact, no GPI seems to have this, so it may be coming from a parsed GAF? Weird. Before digging in more, does this ring any bells @vanaukenk ?

kltm commented 1 year ago

Okay, I take that back: I've found the source in the wb.gpi:

bbop@wok:/home/skyhook/release/products/annotations$ zcat wb-src.gpi.gz | grep "F07B7.5"
WB  WBGene00001923  his-49  HIStone CELE_F07B7.5    gene    taxon:6239  UniProtKB:P08898    
WB  F07B7.5 his-49  HIStone CELE_F07B7.5    transcript  taxon:6239  WB:WBGene00001923       
WB  CE03253 HIS-2   HIStone CELE_T10C6.13   protein taxon:6239  WB:T10C6.13|WB:F45F2.13|WB:ZK131.3|WB:ZK131.7|WB:K06C4.5|WB:ZK131.2|WB:K06C4.13|WB:F17E9.10|WB:K03A1.1|WB:F08G2.3|WB:B0035.10|WB:F07B7.5|WB:F54E12.1|WB:F55G1.2|WB:F22B3.2  UniProtKB:P08898|UniProtKB:K7ZUH9   

This is ringing a bell; I'm going to dig around to see if I can find a previous instance of this.

vanaukenk commented 1 year ago

Interesting. This didn't ring any bells, but there are WB sequence identifiers buried in that string and when I check a few of them, I see that they correspond to genes that produce the exact same protein.

kltm commented 1 year ago

Hm, it looks like we've asked similar questions in the past, and felt that it didn't matter much in the grand scheme of things https://github.com/geneontology/neo/issues/88#issuecomment-1093598908 (note the WB identifier).

vanaukenk commented 1 year ago

Okay. The way the C. elegans protein identifiers are assigned in WB right now, we don't have unique protein ids for each gene if they ultimately produce a protein with the same amino acid sequence. If you think we need a better way of handling this, we can discuss some more.

kltm commented 1 year ago

@vanaukenk @pgaudet As we come up on a few months on this issue (and about a year since closing the variant https://github.com/geneontology/neo/issues/111), I was wondering if we're just documenting this (as we did previously with https://github.com/geneontology/neo/issues/88#issuecomment-1105607070) or if we're going to take the time to try and fix this this time around? I'm not sure how much of a problem this is in this case or if it's causing a problem that's valued as worth fixing right now?