Closed kltm closed 2 years ago
AGI_LocusCode
gene_association.tair.gz contains AGI_LocusCode
. Like a lot.
UniProtKB
At least some of the anomalous UniProtKBs seem to be exclusively in col 8 in uniprot_reviewed.gpi.gz. Not sure why being in a different column would throw this off...possible due to "has_gene_template" in gpi2obo.pl?
MGI
MGI spoken for at https://github.com/geneontology/go-annotation/issues/4105
WB
Traced anomaly back to c_elegans.PRJNA13758.current.gene_product_info.gpi.gz:
WB CE05165 HIS-48 HIStone CELE_B0035.8 protein taxon:6239 WB:B0035.8|WB:F54E12.4|WB:F55G1.3|WB:H02I12.6 UniProtKB:Q27876
It appears that a parser is taking "WB:B0035.8|WB:F54E12.4|WB:F55G1.3|WB:H02I12.6" and trying to turn the "B0035.8|WB:F54E12.4|WB:F55G1.3|WB:H02I12.6" part into an identifier. That's pretty wild. How/why is GPI column 8 getting parsed? Looks like gpi2obo.pl
and it would go into parent and then OBO as relationship: has_gene_template $parent
. The code seems wrong there, but I'm not familiar enough with the OBO format and the intention here to make a call on whether that should be dropped or split.
From managers' discussion, important things traced/docced--this is now closed.
Recently (https://github.com/geneontology/neo/issues/82#issuecomment-1090933309), we noticed a number of oddities in NEO.
In the newest NEO load (and maybe some of these are in the older one), some entities that were not correctly converted to CURIEs--1350337 in total. Some of those are probably not practically important as nobody would be curating to them, but some seem important.
We would like to trace these back to their source files and try and figure out what is going on.
Important seeming anomalies:
Samples of complete list:
One spin-off from this for MGI is here: https://github.com/geneontology/go-annotation/issues/4105
This is not considered a blocking issue for moving forward with this project and closing it. If closed before completing this project, we can bump this over to the QC one.