geneontology / minerva

BSD 3-Clause "New" or "Revised" License
6 stars 8 forks source link

add a CLI command to dump all models in JSON response format #500

Closed balhoff closed 2 years ago

balhoff commented 2 years ago

See https://github.com/geneontology/api-gorest-2021/issues/6#issuecomment-1181018713

@kltm is this something we should go ahead and start on?

kltm commented 2 years ago

@balhoff If it's not too hard: yes. This is part of a discretionary project, so we are clear to move ahead. The idea would be to get a dump of separate JSON files into a directory, which look like the JSON model contents of returned responses. Alternatively, if a mega-file was somehow much easier, we could maybe feed that into jq or something and work it out.

kltm commented 2 years ago

Considering how we're going to use this, a tarballed version may eventually find its way into releases.

kltm commented 2 years ago

@balhoff Looking at the output, I think that it should be CURIEs rather than URIs to match what we're currently doing. (Re: https://github.com/geneontology/minerva/pull/501) Otherwise, I think this may be it.

kltm commented 2 years ago

@balhoff Apologies to tag you here again, but I think my message above crossed when you were out. I just wanted to let you know that I tested, but we're currently using CURIEs instead of URIs for this use case (just like communicating with noctua).

balhoff commented 2 years ago

@kltm thanks, this was a mistake in setting up the CURIE handler. I fixed it: https://github.com/geneontology/minerva/pull/501/commits/a16995b44aeb74e7bed8f2f799736bf8c7383d9c

kltm commented 2 years ago

Great--thank you! I'm trying it out now.

kltm commented 2 years ago

@balhoff I attempted to run the full command, but it seemed to error out towards the end, after about an hour and a half runtime, with:

2022-08-11 15:16:27,327 WARN  (com.bigdata.rdf.ServiceProviderHook:171) Running.
2022-08-11 15:16:27,327 WARN  (com.bigdata.rdf.ServiceProviderHook:171) Running.

java.lang.IllegalStateException: Manager on ontology OntologyID(OntologyIRI(<http://model.geneontology.org/MGI_MGI_1924374>) VersionIRI(<http://model.geneontology.org/MGI_MGI_1924374>)) is null; the ontology is no longer associated to a manager. Ensure the ontology is not being used after being removed from its manager.
    at uk.ac.manchester.cs.owl.owlapi.OWLImmutableOntologyImpl.getOWLOntologyManager(OWLImmutableOntologyImpl.java:202)
    at uk.ac.manchester.cs.owl.owlapi.concurrent.ConcurrentOWLOntologyImpl.withReadLock(ConcurrentOWLOntologyImpl.java:162)
    at uk.ac.manchester.cs.owl.owlapi.concurrent.ConcurrentOWLOntologyImpl.getOWLOntologyManager(ConcurrentOWLOntologyImpl.java:238)
    at org.geneontology.minerva.ModelContainer.getOWLOntologyManager(ModelContainer.java:95)
    at org.geneontology.minerva.ModelContainer.dispose(ModelContainer.java:103)
    at org.geneontology.minerva.CoreMolecularModelManager.unlinkModel(CoreMolecularModelManager.java:805)
    at org.geneontology.minerva.CoreMolecularModelManager.dispose(CoreMolecularModelManager.java:822)
    at org.geneontology.minerva.BlazegraphMolecularModelManager.dispose(BlazegraphMolecularModelManager.java:826)
    at org.geneontology.minerva.cli.CommandLineInterface.modelsToJSON(CommandLineInterface.java:530)
    at org.geneontology.minerva.cli.CommandLineInterface.main(CommandLineInterface.java:240)

It also seemed to be a few short of all the models

sjcarbon@moiraine:/tmp/jsonout$:) ls -AlF | wc -l
41940
sjcarbon@moiraine:~/local/src/git/noctua-models[master]$:( ls -AlF models/ | wc -l
42130

But maybe that's due to some not being capable of producing JSON for some reason? (Although it could be due to an increase in models while I was doing the periodic update flush--I may have to double check that.)

Any thoughts on this error?

balhoff commented 2 years ago

Thanks for testing. I added parallelism to the output, since it was so slow. I'll disable this and see if it fixes the error. I suspect Minerva is not as robust to multithreading as we would like.

balhoff commented 2 years ago

@kltm I updated the branch without the parallelism. The job ran to completion on my laptop.

kltm commented 2 years ago

Currently testing on pipeline build machine.

kltm commented 2 years ago

@balhoff The output now seems in line with what I'd expect--thank you!

Noting that this took two hours on a pretty peppy machine--it might be good to explore what's going on with this as we are likely having the same issue writ small all the time in responses.

From here, we can look at adding this to the pipeline, and then onto supporting the GO-CAM API...

kltm commented 2 years ago

This ticket can be closed once added to the pipeline and the products are located in an accessible location for an API. https://github.com/geneontology/api-gorest-2021/issues/6

kltm commented 2 years ago

@balhoff IIRC, you were thinking that there might be an optimization that could be done to help accelerate the JSON dump process?

balhoff commented 2 years ago

So far I've just done some profiling; it looks like about half the time is reading models out of the database, and the other half is doing the queries to categorize nodes by high level terms.

kltm commented 2 years ago

Okay, I think I've found a place where is parallelizes fairly well, so maybe we don't have to worry too much about speed for the moment (although this might not scale). With that, I think we're done here--further issues can be a new ticket. Thank you @balhoff !