geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

SGD IBA influx at source causes false positive(?) in sanity checks #2371

Open kltm opened 1 week ago

kltm commented 1 week ago

During the most recent snpashot run, we failed on an SGD sanity check.

Essentially, the SGD source file had 152076 annotations and the final file had 65609--a reduction of about 100k. This large reduction triggered a failsafe (good!).

Looking into it, I currently believe the issue is with IBAs.

The line count of filtered incoming IBAs is about 100477; the line count of injected IBAs is about 15330; that would account for the bulk of the drop.

As GOC is the canonical source, we're doing the right thing here and we can (and temporarily will) suppress the SGD sanity check, but the IBA noise does limit the use of this primitive check.

Tagging @pgaudet @suzialeksander

kltm commented 1 week ago

Looking around, this has been an "issue" since from around the last release, I'm guessing related to new code in one way or another. There has also been a reduction in the SGD upstream size. I'm honestly not sure how the sanity checks have not been triggered before this. I'm going to pause the current snapshot attempt for the moment, waiting for feedback.

kltm commented 1 week ago

The reason it specifically seems to have ticked over into failure is that it crossed over the 50% reduction mark.

dustine32 commented 1 week ago

Checking the SGD report, this could be due to recent changes in ID checking code:

WARNING - Invalid identifier: GORULE:0000027: 2144215 does not match any id_syntax patterns for MGI in dbxrefs (MGI:MGI:2144215) -- SGD S000005027 SAL1 enables GO:0005347 GO_REF:0000033 IBA MGI:MGI:2144215 F ADP/ATP transporter YNL083W|Ca(2+)-binding ATP:ADP antiporter SAL1 protein taxon:559292 20231109 GO_Central UniProtKB:D6W196

The warning message points to matching 2144215 against MGI regex pattern MGI:[0-9]{5,}, which wouldn't be valid: https://github.com/geneontology/go-site/blob/7869016e9303fd5e7840cb1d7a2d09272aaf4d36/metadata/db-xrefs.yaml#L1616 Though it lists only 8361 lines and they are just WARNINGs so not necessarily dropped lines. I also haven't really confirmed this in a debugger. @mugitty would you be able to debug these SGD IBA lines? This may not be the cause, mainly a hunch. What do you think?

kltm commented 1 week ago

@dustine32 It's not so much the warnings (which aren't great), but the fact that there are soooo many upstream IBAs that we get significantly closer to the sanity check trigger just by filtering them and injecting our own, as desired.