Open kltm opened 1 month ago
Looking around, this has been an "issue" since from around the last release, I'm guessing related to new code in one way or another. There has also been a reduction in the SGD upstream size.
I'm honestly not sure how the sanity checks have not been triggered before this. I'm going to pause the current snapshot
attempt for the moment, waiting for feedback.
The reason it specifically seems to have ticked over into failure is that it crossed over the 50% reduction mark.
Checking the SGD report, this could be due to recent changes in ID checking code:
WARNING - Invalid identifier: GORULE:0000027: 2144215 does not match any id_syntax patterns for MGI in dbxrefs (MGI:MGI:2144215) -- SGD S000005027 SAL1 enables GO:0005347 GO_REF:0000033 IBA MGI:MGI:2144215 F ADP/ATP transporter YNL083W|Ca(2+)-binding ATP:ADP antiporter SAL1 protein taxon:559292 20231109 GO_Central UniProtKB:D6W196
The warning message points to matching 2144215
against MGI regex pattern MGI:[0-9]{5,}
, which wouldn't be valid:
https://github.com/geneontology/go-site/blob/7869016e9303fd5e7840cb1d7a2d09272aaf4d36/metadata/db-xrefs.yaml#L1616
Though it lists only 8361 lines and they are just WARNINGs so not necessarily dropped lines. I also haven't really confirmed this in a debugger. @mugitty would you be able to debug these SGD IBA lines? This may not be the cause, mainly a hunch. What do you think?
@dustine32 It's not so much the warnings (which aren't great), but the fact that there are soooo many upstream IBAs that we get significantly closer to the sanity check trigger just by filtering them and injecting our own, as desired.
During the most recent
snpashot
run, we failed on an SGD sanity check.Essentially, the SGD source file had 152076 annotations and the final file had 65609--a reduction of about 100k. This large reduction triggered a failsafe (good!).
Looking into it, I currently believe the issue is with IBAs.
The line count of filtered incoming IBAs is about 100477; the line count of injected IBAs is about 15330; that would account for the bulk of the drop.
As GOC is the canonical source, we're doing the right thing here and we can (and temporarily will) suppress the SGD sanity check, but the IBA noise does limit the use of this primitive check.
Tagging @pgaudet @suzialeksander