geneontology / go-shapes

Schema for Gene Ontology Causal Activity Models defined using RDF Shapes
2 stars 0 forks source link

investigate performance issue #199

Closed goodb closed 4 years ago

goodb commented 4 years ago

See https://github.com/geneontology/pathways2GO/issues/69#issuecomment-578879485

The expansion of the shex schema appears to be severely slowing down the shex validator as currently implemented.

The Reactome models are very large compared to the vast majority of other models we are dealing with right now (thinking of the genome imports) but its still concerningly slow...

ericprud commented 4 years ago

Is this shex.js? If so, swapping regex engines will probably speed things up. It uses two regex engines:

If the set of possible errors grows large, threaded-val-nerr will be much slower than nfax-val-1err.

You can switch like:

const ShExCore = requre('@shexjs/core')
...
const validator = ShExCore.Validator.construct(schema, ShExCore['nfax-val-1err'])
goodb commented 4 years ago

@ericprud our validator code lives in https://github.com/geneontology/minerva which is a java project. It uses https://github.com/iovka/shex-java which, up till now, has served the purpose quite well... I'd be curious to run a comparison but its hard to see us switching this to the js version in the near future.

ericprud commented 4 years ago

It seems unlikely to me that ShEx.js would out-perform ShEx-java. Should I pester Iovka?

goodb commented 4 years ago

@ericprud sure that would be great. I can give her some of the slow ones if she wants to play. I did some things to extract explanations that might also be slow, I'm sure there are plenty of opportunities to optimize.

goodb commented 4 years ago

Update here. It looks like the slowdown has nothing to do with shex.. The thing that is slowing us down now is referring to an external Web service to access high level rdf:types for gene nodes..
This https://github.com/geneontology/minerva/pull/265#issuecomment-564267925 was done for this ticket https://github.com/geneontology/minerva/issues/260

goodb commented 4 years ago

@kltm does the noctua GOLR server support a batch query mode? That could speed this up a lot.

kltm commented 4 years ago

It depends on what you might mean by batch. You can get a large number of docs with a single query and sort them out yourself. Nothing much more structured than that however.

kltm commented 4 years ago

Depending on where and when this is to run, running ones own GOlr in a docker image with a particular data set is pretty easy.

goodb commented 4 years ago

I was imagining sending a set of curies and having a response that mapped each to the result I would get from sending one at a time. Basically Minerva (both when used from command line for validation and when used as a server) is currently hitting GOLR with all the class ids in a model each time reasoning/validation is requested.

I just added a cache that will make repeated identical requests less of a problem. Still if I could get e.g. 20 curies at once instead of making 20 requests that seems better.

When golr is sitting on the same machine as minerva, as I assume would be typical, ought to be pretty fast. But as we continue to expand the use of the shex validator, that won't always be the case unless we work on having local GOLRs wherever that code is executed.

I'd default to making the main one faster first. If that fails, then on to the next solution.

kltm commented 4 years ago

Without making a wrapper yourself, where you save the query and make a hash with the results yourself, the answer would be "no"--Solr deals in returning a list (set) of documents. One could POST a large number of ORed ids and get the desired results, but you would still left to remap on your own.

For the pipeline, there will be the option of having minerva and a GOlr co-located. Currently, in the editorial production environment, they are separate machines.

balhoff commented 4 years ago

It would be simple to do this with a SPARQL endpoint. Should we consider it?

kltm commented 4 years ago

@balhoff A new endpoint built up early in the pipeline?

balhoff commented 4 years ago

This is something that would need to be running all the time for Minerva. Alternatively we could redesign Minerva a little bit to load NEO into its triplestore in a special graph and do the queries locally. This would not cause a large memory usage, as loading into OWL API would.

goodb commented 4 years ago

@kltm has also suggested running a copy of the GOLR service alongside Minerva with the same basic effect. I guess fundamentally its a question of whether we want to push Minerva forward as something that consumes centralized content or is more at the center of its own content universe.

My premise in support of the former was that we need a good centralized public service anyway, so we might as well leverage that in the Minerva context and avoid duplicating things. But @kltm would have a better notion of whether that premise is justified. Another route along those lines is to build up and open Minerva as that one-stop-shop service for GO content.

Architects?

kltm commented 4 years ago

@goodb I would suggest quick spin-out call to go over this real fast--I want to make sure we all understand what's going on and which targets we want to hit for different uses.

kltm commented 4 years ago

Notes from quick call: we'll look to explore Solr a little bit more, before starting the exploration of other possibilities. Minerva can be setup to batch requests to GOlr, getting the minimal information it needs. This is hopefully sufficient for our "live" use cases for now; in a pipeline context, this would be expected to saturate a Solr instance, so a private image should be provided.

goodb commented 4 years ago

@kltm I have some example 'batch' style requests working from minerva but rapidly ran into 414 error, URL too long. I'm assuming the solution will be to convert from GET to POST. Let me know if there is any problem POSTing queries to GOLR, will try to set it up tomorrow.

If you could share an example OR query over a few genes, that would be useful for me to doublecheck what I have come up with so far.

kltm commented 4 years ago

Yes, for anything but fairly trivial queries (i.e. you'd expect to copy/paste/publish the link) one should go ahead and use POST.

kltm commented 4 years ago

@goodb From earlier, you can run a local solr instance easily from one of the produced pipeline solr dumps with: https://github.com/geneontology/operations/tree/master/docker/amigo-standalone#grab-remote-pipeline-contents-and-use

goodb commented 4 years ago

@kltm I ran the prebuilt GOLR image as per instructions. (You might add that the server will appear at http://127.0.0.1:8080/solr/ as that wasn't totally obvious).

It is indeed faster. Running the 15 models currently being used as positive tests for the schema development validation (owl and shex) completed in 3.2, 5.1, 5.1 seconds on three different runs with the local server. Not sure why the variance. Using noctua.golr I got 6.4, 7, and 6.5 seconds for the same models. So its not overwhelmingly different, both are fast enough for UI, but over a million it would matter.

I added a parameter to the minerva command line validator that makes it possible to set the golr location.

kltm commented 4 years ago

@goodb Great! That last sentence there is what we eventually would want anyways--all pipeline functions need to be able to be rewired to look inward.

goodb commented 4 years ago

Investigation completed for the moment. Java Shex validator was not the problem this time, but rather, reasoning (and reasoning-like tricks) to generate the full models for validation was the culprit.