geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Severe annotation drop reported in release pipeline #282

Closed kltm closed 2 years ago

kltm commented 2 years ago

Reported by @pgaudet:

"It looks like we lost more than half of the annotations: (data from http://skyhook.berkeleybop.org/release/release_stats/go-annotation-changes.tsv)"

SUMMARY: DIFF BETWEEN RELEASES

annotated bioentities:    -534851

Also, that go-annotation-changes.tsv has data that never used to be there, that starts with “WARNINGS”

kltm commented 2 years ago

@dustine32 "eyeballed the annotation file sizes real quick and they 'looked normal'".

A current working theory from @dustine32 is that it may be "an upstream data issue (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy) or if something else (on our side?) is breaking the go-stats tax".

dustine32 commented 2 years ago

Short story: This is on our side but it's just a bug with the reporting, not the actual product data. To fix, I think we just need to remove a hard-coded aspgd.gaf reference from a script run from docker golr-autoindex.

Long story: Noticed that there was a drastic reduction of taxa returned from the API call to https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy - 129 on 2022-04-19 vs. 5168 on 2022-03-22. This is called from the go-stats code. It turns out this is due to the lower number of taxons we send in the params, which come from the all_annotations DS retrieved from the Golr instance locally running in the pipeline.

Backtracking to the logs for the in-pipeline Golr loading step, I see a new FileNotFoundException for http://skyhook.berkeleybop.org/release/annotations/aspgd.gaf.gz. This makes sense as aspgd was recently dropped as a product so wouldn't be available. Now need to remove the hard-coded reference in the run-indexer.sh script.

@kltm I think the action items are now:

kltm commented 2 years ago

@dustine32 Great--thank you for getting to the bottom of this. The fix is actually pretty simple (striking out your TODO list): when I propagated the removal of aspgd, I did not (forgot to) apply it to release. The variable in the docker image is a default and not used, as it's supplied from outside. I believe that chain of events is this:

I've put the fix in place and release is ready to go again. I think there is a little GeneDB work to do before triggering.

kltm commented 2 years ago

Noting https://github.com/geneontology/go-site/issues/1820