geneontology / neo

noctua entity ontology
9 stars 2 forks source link

Trace entities that do not create nice URIs (i.e. compact well to CURIEs in out pipeline) in NEO #88

Closed kltm closed 2 years ago

kltm commented 2 years ago

Recently (https://github.com/geneontology/neo/issues/82#issuecomment-1090933309), we noticed a number of oddities in NEO.

In the newest NEO load (and maybe some of these are in the older one), some entities that were not correctly converted to CURIEs--1350337 in total. Some of those are probably not practically important as nobody would be curating to them, but some seem important.

We would like to trace these back to their source files and try and figure out what is going on.

Important seeming anomalies:

http://purl.obolibrary.org/obo/AGI_LocusCode_XYZ : 28986
http://identifiers.org/wormbase/XYZ : 152
http://identifiers.org/uniprot/XYZ : 49
http://purl.bioontology.org/ontology/provisional/XYZ : 17
http://identifiers.org/mgi/MGI:XYZ : 4

Samples of complete list:

alters_location_of
anastomoses_with
anteriorly_connected_to
attached_to
channel_for
channels_from
...
synapsed_by
Tmp_new_group
transitively_anteriorly_connected_to
...
transitively_proximally_connected_to
trunk_part_of
TS01
...
TS28
xunion_of
http://identifiers.org/mgi/MGI:106910
http://identifiers.org/uniprot/A0A5F9CQZ0
http://identifiers.org/wormbase/B0035.8%7CWB%3AF54E12.4%7CWB%3AF55G1.3%7CWB%3AH02I12.6
http://purl.bioontology.org/ontology/provisional/1ddd2e2d-2ace-4c87-8ec6-d3b5730b3e7c
http://purl.obolibrary.org/obo/D96882F1-8709-49AB-BCA9-772A67EA6C33
http://semanticscience.org/resource/SIO_000658
http://www.geneontology.org/formats/oboInOwl#Subset
http://www.w3.org/2002/07/owl#topObjectProperty
http://xmlns.com/foaf/0.1/image

One spin-off from this for MGI is here: https://github.com/geneontology/go-annotation/issues/4105

This is not considered a blocking issue for moving forward with this project and closing it. If closed before completing this project, we can bump this over to the QC one.

kltm commented 2 years ago

AGI_LocusCode

gene_association.tair.gz contains AGI_LocusCode. Like a lot.

UniProtKB

At least some of the anomalous UniProtKBs seem to be exclusively in col 8 in uniprot_reviewed.gpi.gz. Not sure why being in a different column would throw this off...possible due to "has_gene_template" in gpi2obo.pl?

MGI

MGI spoken for at https://github.com/geneontology/go-annotation/issues/4105

WB

Traced anomaly back to c_elegans.PRJNA13758.current.gene_product_info.gpi.gz:

WB CE05165 HIS-48 HIStone CELE_B0035.8 protein taxon:6239 WB:B0035.8|WB:F54E12.4|WB:F55G1.3|WB:H02I12.6 UniProtKB:Q27876

It appears that a parser is taking "WB:B0035.8|WB:F54E12.4|WB:F55G1.3|WB:H02I12.6" and trying to turn the "B0035.8|WB:F54E12.4|WB:F55G1.3|WB:H02I12.6" part into an identifier. That's pretty wild. How/why is GPI column 8 getting parsed? Looks like gpi2obo.pl and it would go into parent and then OBO as relationship: has_gene_template $parent. The code seems wrong there, but I'm not familiar enough with the OBO format and the intention here to make a call on whether that should be dropped or split.

kltm commented 2 years ago

From managers' discussion, important things traced/docced--this is now closed.