geneontology / minerva

BSD 3-Clause "New" or "Revised" License
6 stars 8 forks source link

Annotations missing from noctua.mgi.gpad on snapshot #335

Closed ukemi closed 3 years ago

ukemi commented 4 years ago

The mgi gpad file on snapshot appears to be missing annotations: http://snapshot.geneontology.org/products/annotations/noctua_mgi.gpad.gz

This model checks out but I can't find any of the annotations in snapshot from 7/19/2020. http://noctua.geneontology.org/editor/graph/gomodel:5ee8120100001244?model_id=gomodel:5ee8120100001244

kltm commented 3 years ago

@ukemi Can you confirm that the correct (latest) version of you model is at: https://github.com/geneontology/noctua-models/blob/master/models/5ee8120100001244.ttl ? According to GH and the model metadata, no changes since 2020-07-06. (I just want to make sure we're at least starting from the issues is in minerva and not in saving, model push, etc.)

ukemi commented 3 years ago

Hi @kltm,

Yes, this is the model.

goodb commented 3 years ago

@ukemi when you look at the model in noctua, are the annotations missing - e.g. http://noctua.geneontology.org/workbench/annpreview/?model_id=gomodel:5ee8120100001244 ? Indicating either pipeline or minerva gpad generation error. If they are, could you give one example that is not present in the gpad and should be ?

ukemi commented 3 years ago

The annotations are there.

ukemi commented 3 years ago

Could this also be related to #328? Also now questioning whether we really want to implement #269

hdrabkin commented 3 years ago

Alka-Selzer moment I discovered today that between 6/17 and 6/18, we lost over 50% of our Noctua annotations: 6/17 NOCTUA Annotations: Total Number of Genes Annotated to: 1027 Total Number of Annotations: 6716

6/18 NOCTUA Annotations: Total Number of Genes Annotated to: 551 <<<<<<<<<< 476 loss Total Number of Annotations: 3111 <<<<<<<<<< 3605 loss

kltm commented 3 years ago

TL;DR: So, what we seem to have here is the model-state getting dropped for some reason somewhere in the minerva steps (below); it seems to be there in GH: https://github.com/geneontology/noctua-models/blob/291a0a75bc7a890800da2e13f1953cea6a42aa21/models/5ee8120100001244.ttl#L16 and does not appear in the GPAD.

Any ideas @balhoff or @goodb ?


To spell out how to reproduce this:

From @ukemi 's comment https://github.com/geneontology/minerva/issues/335#issuecomment-661816653 , we know that these have gotten at least into GH. This would seem to leave to error points: 1) pipeline mechanics (in feeding or handling) or 2) minerva error.

Grabbing the log from the last successful snapshot, it's mentioned six times:

[2020-08-03T07:34:02.861Z] 2020-08-03 00:34:02,780 INFO  (CommandLineInterface:442) Loading models/5ee8120100001244.ttl
[2020-08-03T07:55:43.643Z] 2020-08-03 00:55:43,550 INFO  (BlazegraphMolecularModelManager:594) Load model abox: http://model.geneontology.org/5ee8120100001244 from database
[2020-08-03T08:01:07.825Z] + perl ./util/collate-gpads.pl [A LOT OF STUFF] legacy/gpad/5ee8120100001244.gpad [A LOT OF STUFF]
[2020-08-03T18:58:33.198Z] 2020-08-03 18:58:32,969 INFO org.renci.blazegraph.Load$ - Loading target/noctua-models/models/5ee8120100001244.ttl
[2020-08-03T19:01:36.437Z] http://model.geneontology.org/5ee8120100001244
[2020-08-03T19:02:20.871Z] 2020-08-03 19:02:20,841 INFO org.renci.blazegraph.Reason$ - 1253 changes in Some(http://model.geneontology.org/5ee8120100001244_inferred)

Poking around the stage logs a bit, this seems mechanically what I'd expect.

Trying to simulate locally:

git clone https://github.com/geneontology/noctua-models.git
mkdir models
mv noctua-models/models/5ee8120100001244.ttl ./models/
~/local/src/git/minerva/minerva-cli/bin/minerva-cli.sh --import-owl-models -f models -j blazegraph.jnl
mkdir -p legacy/gpad
~/local/src/git/minerva/minerva-cli/bin/minerva-cli.sh --lego-to-gpad-sparql --ontology http://skyhook.berkeleybop.org/snapshot/ontology/extensions/go-lego.owl -i blazegraph.jnl --gpad-output legacy/gpad
grep -c "5ee8120100001244" legacy/gpad/5ee8120100001244.gpad 

14

Which, I believe, means that our annotation have gotten this far. The final step is:

perl noctua-models/util/collate-gpads.pl legacy/gpad/*.gpad
No production models for MGI
No production models for MGI
No production models for MGI
No production models for MGI
No production models for MGI
No production models for MGI
No production models for MGI
No production models for MGI
No production models for MGI
No production models for MGI
No production models for MGI
No production models for MGI
No production models for MGI
No production models for MGI

and there is no further output...which I think is a problem?

In the script the following seems to be triggered:

    if (!grep {$_ eq 'model-state=production'} @props) {

Ah!

grep -c "state" legacy/gpad/5ee8120100001244.gpad 

0

goodb commented 3 years ago

@kltm I think I have a solution and a cause. Want to run it by @balhoff but I suspect this will do it. I seem to have introduced this in an earlier quest to fix some other problem.

goodb commented 3 years ago

@kltm I think it would be straightforward to add a parameter to the minerva client that would apply the 'production-only' filter at the time the GPAD was generated. Do you want me to do that? Having that perl script that you discovered in the middle of the gpad assembly process for the pipeline seems maybe not so good from the standpoint of testing and stability. LMK.

hdrabkin commented 3 years ago

Just wondering: we are still missing about 50% in the download. Any progress?

kltm commented 3 years ago

There is likely an incoming fix with #341 , pending review from @balhoff .

kltm commented 3 years ago

@hdrabkin @ukemi We should hopefully get some results from the new code on Friday.

pgaudet commented 3 years ago

In the release candidate we are missing several annotations coming from SynGO via Noctua: for example:

I think this is blocking for the Sept 2020 release.

hdrabkin commented 3 years ago

We Did get ours back last week (we were missing 50%, mix of both SynGO and MGI

pgaudet commented 3 years ago

@kltm could the files we loaded be out of date ?

hdrabkin commented 3 years ago

The current snapshot file appears to have 6916 lines attributed to SynGO. File header is date is 8/30/2020

pgaudet commented 3 years ago

Thanks @hdrabkin We did this data in the Sept release (release candidate has 4980 SynGO annotations). I will stop the release process.

kltm commented 3 years ago

@pgaudet I think that if this is an issue, it would be a new issue, not related to the "production" tag issue we had here. Are you looking at the output GPAD products from like noctua_mgi.gpad.gz?

kltm commented 3 years ago

Talking to @pgaudet earlier, this may just be an "echo" of this issue as it passes through various external pipelines that are on different schedules.