geneontology / minerva

BSD 3-Clause "New" or "Revised" License
6 stars 8 forks source link

Add GO-CAM Validation Service to Minerva #212

Closed kltm closed 4 years ago

kltm commented 5 years ago

To support moving the logic of error checking, etc., from the client to the server, we want to build a new flow through Minerva that:

As an initial trial, the inconsistent return item will be implemented with this pattern.

cmungall commented 5 years ago

empty list?

@dougli1sqrd @balhoff I assume we will run the rules per-model. Do we need to include an explicit graph parameter into each query template or can this be handled at the API level?

kltm commented 5 years ago

@cmungall Wanted to have a placeholder to come back to as we're out of the room. Will be filled in ASAP

kltm commented 5 years ago

Possible JSON return value format, with key go-rules:

[
 {
  "id": "ruleid",
  "violating-entities": [
   {
    "id": "entity:id",
    "commentary": []
   }
  ],
  "commentary": []
 }
]
kltm commented 5 years ago

If this format doesn't work, let's split it into another ticket and refer this and geneontology/noctua#598

dougli1sqrd commented 5 years ago

@cmungall we shouldn't need a specific graph parameter. @balhoff explained yesterday that when we load and modify a model we go from the blazegraph database, load a model into an OWL API object, and then with the reasoner on, the owl model gets converted into a jena model (triples) so that arachne can run. This Jena model is what the go rules sparql/shex will run on.

goodb commented 5 years ago

Noting to @balhoff this has been put on my plate. Would like to touch base with you before digging in.

kltm commented 5 years ago

Also noting existence of https://github.com/geneontology/noctua/issues/567 ; perhaps we should close that one for this.

goodb commented 5 years ago

@kltm agreed - we should close one or the other of these (or both and make a new one). I think its appropriate for the issue to live in the minerva repo though I like the somewhat higher level description in the noctua issue better as I'm not sure a bunch of SPARQL queries will be the form this ends up taking in the long run.

dougli1sqrd commented 5 years ago

@goodb What Rule do we want to test with to encode in sparql first for this?

goodb commented 5 years ago

@dougli1sqrd Regarding priority for implementing rules, I don't know - doesn't really matter a whole lot at this point as I think most of the immediate work is getting the communication protocol settled. Which rules are already implemented with SPARQL on the triplestore? Maybe we could start with these? Another approach is simply to step down in order from the top as that will inform us of what we can and can't accomplish in SPARQL versus other methods.

I started documenting thought processes around this issue at the bottom of @balhoff 's Minerva Notes doc . Also, FWIW I got something working today inside Minerva that will run sparql on a specific (reasoned) model from Noctua and return results.

goodb commented 5 years ago

I pushed a branch with a hack at implementing a go rule service. It is working, but clearly needs some more thought. https://github.com/geneontology/minerva/commit/25d4f3b5950220be37ce7b0a4c325bb473afacac

dougli1sqrd commented 5 years ago

This is cool! I had a plan back in the day to use https://github.com/geneontology/sparqlr as a place to put queries for this kind of thing. Does something like that work?

Currently, this is run during certain minerva requests from Noctua?

So, when the rules are run, and the report is generated, it gets saved off. Does Minerva know yet if the rule failed (by seeing results of the query)? Like how does it show up as having failed?

goodb commented 5 years ago

This is cool! I had a plan back in the day to use https://github.com/geneontology/sparqlr as a place to put queries for this kind of thing. Does something like that work?

It reads the rules from a directory, so wherever that is shouldn't really matter. You can see the rules here now: https://github.com/geneontology/minerva/commit/25d4f3b5950220be37ce7b0a4c325bb473afacac#diff-f0f888bedad707d1e31cee8cce8788b6

Currently, this is run during certain minerva requests from Noctua?

Correct. I stole the hook to request a gpad file for the current model as a test. Shouldn't be hard to move it around when we decide where it should go.

So, when the rules are run, and the report is generated, it gets saved off. Does Minerva know yet if the rule failed (by seeing results of the query)? Like how does it show up as having failed?

Right now it just displays the report as text, nothing gets saved anywhere. (and the report isn't really much yet accept proof that the sparql queries executed). Haven't yet created a protocol for deciding on/displaying rule failure. Need to talk/think/document that out. I think Seth's json snippet above seems like a good start. The Noctua client could pick that up and then give the user direct feedback.

Another thing to consider is how this might work in batch mode. This is really one at a time and its not super fast.

cmungall commented 5 years ago

This is great.

How does the interplay of inference and reasoning work here? There is a prefix declared for arachne inferred_type, but looks like you are querying for rdf:type with the assumption this is already inferred. We have also been discussing patterns for storing inference in the user triplestore, e.g. separate graphs vs separate predicates, I assume we don't want the patterns to diverge from the minerva triplestore

goodb commented 5 years ago

@cmungall as it stands, the SPARQL is executed over the post-Arachne-reasoned model in the same way that @balhoff does the GPAD generation. All inferred types and relations are there. I had the arachne prefix in that query as I was starting to think about discriminating between inferences and direct assertions for the rule I was starting with (the one that came up in the Reactome discussion last week). Its not actually using that (or working entirely) right now but both rdf:type and inferred_type edges are present in the graph to be used.

This implementation is just based on what is there for the taking in Minerva right now. Just so everyone is on the same page with how this works, all the important bits to know about are in this one-liner: response.data.exportModel = m3.getGo_rules_validator().executeRules(m3.createInferredModel(model.getModelId()));

I hooked the rule validator into the m3 (molecular model manager) class which has the 'create inferred model' method on it. When executed, that will pull the relevant go-cam out of the in-memory model collection, apply the pre-computed Arachne rules to infer new edges, and return an Arachne-generated 'Working Memory' object that contains all the asserted and inferred triples needed for SPARQLing plus some extras like explanations. The rule validator then just uses Jena to run the SPARQL extracted from each rule. This does not use the Minerva Blazegraph instance at all.

For on-demand user-facing operations, I don't think this is a bad pattern. For pipeline work, it will probably make more sense to do all the inference up front, store it in a new graph, and then execute the queries. It might be possible to write queries in such a way that they operate in both contexts without changes? Maybe by specifying the name of the graph as part of the query.

Before optimizing for batch mode though, lets nail down interactive mode. Assuming SPARQL, we need to decide what values can be returned by all queries. I guess we can start with a list of 'violating entities' as above, which would be composed of the URIs for the entities that tripped up the validation checks. Anything else that we would want and could be consistently structured? Are there other kinds of rules apart from statement-not-allowed that we need to consider?

FWIW I think I like the .yml pattern now. Easy to provide metadata for each rule that can be delivered back in the app context. Also easy to extend to other approaches (e.g. shex, rule languages) as long as the rules can be represented with just strings of text.

dougli1sqrd commented 5 years ago

The sparql yaml could also explain what it thinks is a violation by naming sparql variables it expects for easy reporting. For example:

variables:
  - symbol: "?gp"
    label: "Gene Product Instance"
    ??: ??
sparql: >
    ?gp ?r ?foo .

and then when the query results come back, minerva would know what each sparql variable means in a report, or something like that. Seems like we could do lots of stuff with something like that.

goodb commented 5 years ago

Yes, something like that makes sense. Might add a Literal/URI type parameter for each symbol as well. On the other hand the more we push for things to be generic, the easier it will be to implement.

dougli1sqrd commented 5 years ago

I think also it'll depend a lot on what you need when implementing it for what Noctua wants to see I guess. But yes! Types would be cool too!

goodb commented 5 years ago

Spurred on by Wikidata's official adoption of shex, I added a shex validation pattern into the Minerva GO rules framework: https://github.com/geneontology/minerva/commit/712d4ce625520b8b6f4dd844380f33f5fec58342
One thing that this brought up is the concept of sending a 'focus node' or list of nodes of importance along with the request (or querying them out of the graph on the server if we want the logic there). This allows for node-specific validation. Could be relevant to both sparql and shex implementations.

Next steps on this ticket could be:

  1. define the program interface for service I/O more precisely
  2. implement a first cut at a selection of useful rules
  3. move testing into a workbench
  4. embed rule access directly within Noctua graph and form interfaces.

If anyone is interested in testing out some shex rules, this could be useful: http://shexjava.lille.inria.fr/demonstrator

goodb commented 5 years ago

Some team member feedback on the shex pattern would be useful. Just want to see if others think this is worth continuing. To give you an idea, here is a shex file that is a start at implementing GO_RULE2 (with some other stuff as well).
example_shex.txt and here is a ttl file for a go-cam that includes all of the additional triples Arachne reasoning provides.
expanded_reactome-homosapiens-A_tetrasaccharide_linker_sequence_is_required_for_GAG_synthesis-ttl.txt You can test these by pasting them into the shex demonstrator of your choice - e.g. http://shexjava.lille.inria.fr/demonstrator

goodb commented 5 years ago

Informational. A lot of work relevant to this ticket has happened over at https://github.com/geneontology/go-shapes

It looks like we are ready to use the patterns developed in the go-shapes repo to implement this service. Resolution for the moment is to use shex and use the java-shex implementation.

goodb commented 5 years ago

@balhoff @kltm would like to talk about how to add the shex-based validation service into minerva. From here I see two main potential paths:

  1. Extend the existing InferenceProvider code constellation with shex power. e.g. In the main handler JsonOrJsonpBatchHandler around line 200 we would have another boolean like conformstoSchema and we'd extend the CachingInferenceProviderCreatorImpl with a method that ran the shex validation after the reasoner was updated.
  2. Implement as a separate service with a new handler like search API.

    The first conforms more to the minerva architecture while 2) is simpler and perhaps more flexible for other uses - e.g. by clients that don't like dealing with the minerva patterns.. For 2) we would need to think a little about the reasoning step to avoid redundancy.

??

kltm commented 5 years ago

@goodb The current roadmap (as described earlier in this thread and discussed at the January hackathon) is that minerva returns the results on a subset of shapes with every action, leaving it to client to deal with as part of current workflow. This means that clients have a trivial upgrade path. This is essentially your "1" here. We have previously talked about "2", but when we talked about exactly how that would shape up, it seemed to be agreed that it would be more effort at this stage. (I have no separate documentation for this, but I remember this from the hackathon and was what caused the creation of this ticket as formulated.)

goodb commented 5 years ago

Okay roger that. I will follow the main minerva protocol then. Actually wouldn't be hard to do both if the need arises.

kltm commented 5 years ago

Touched bases with @goodb a little bit about format; it looks like what we had in https://github.com/geneontology/minerva/issues/212#issuecomment-458822427 doesn't quite work and he is experimenting with a pattern that can better capture the useful information from ShEx.

goodb commented 4 years ago

Informational update. (Noting that a lot of relevant discussion has happened over on https://github.com/geneontology/go-shapes/issues/197 ).

The dev branch of the Noctua/Minerva code base now contains a shex (and OWL) validation service. This is currently running live on noctua-dev. The input to the service is a go-cam owl model (e.g. .ttl file) and the output is a report indicating whether the model is logically consistent (owl) and whether or not it is in agreement with the shex schema and query map. For the shex validation, an informative response about why a given shape did not match the expected node in the submitted owl model is provided. This report comes back as a JSON object when accessed via the minerva web service (integrated into the main service that does everything). In addition all of the shapes matching all of the nodes in the model are returned in the payload. When executed from the command line, 2 report files are created. One is a simple list of the all the models tested and true/false for owl valid and shex valid. The other file is a more detailed tab-delimited reporting of the reasons for any errors (containing much the same information as the JSON object).

Potential extensions:

  1. Add explanations for the OWL errors detected. (Leverage existing code that produces the inference explanations view but focus on causes for errors and format into the Violations object structure used for the shex.)
  2. Find ways to increase speed to support large batch runs.
    • A key slow point right now is looking up the main type information for genes in the models using a GOLR server. This is related to the intention to eliminate the neo owl file from the minerva box environment (aka go_lego.owl). See https://github.com/geneontology/minerva/issues/260 . Recent improvements (batch requests) to GOLR interaction have greatly speeded this up, but its still a place to look for optimization using caching.
    • There may be ways to increase parallelism for the command line interface.
    • Integrate the execution of non-shex-based GO Rules into this framework ???
goodb commented 4 years ago

There will be refinements, but shex validation is now operational both on the command line and in minerva service. Move specific change requests to new issues.

tmushayahama commented 4 years ago

@vanaukenk any plans for validation in NF. Just like the display error messages on NF-1

vanaukenk commented 4 years ago

@tmushayahama

Yes, we need to discuss how best to incorporate the validation reports into the NF.

Ideally, we will have set up the NF such that curators can't make invalid models, but we still might have the issue of reading into NF invalid models, that haven't been fixed yet, originating from the graph.

One possibility might be to not try to display invalid models, and instead provide a link to the validation report to prompt the curator to fix the model in the graph.