SciGraph / golr-loader

Convert SciGraph queries into json that can be loaded by Golr
Apache License 2.0
1 stars 3 forks source link

[#17] Configure paths for anatomy queries #34

Closed benwbooth closed 7 years ago

benwbooth commented 7 years ago

This pull request adds some extra syntax to the monarch-cypher-queries yaml that allows specifying closure patterns for subject, object, relation, and evidence as a resolved cypher query.

The main changes are adding a resolveRelationships function to GolrLoader which calls cyperUtil.resolveRelationships, and parses out the resolved types from the returned string. This approach allows usage of the ! entailment operator.

I also added fields subject_closure, object_closure,relation_closure and evidence_closure to GolrCypherQuery, which is used to parse the yaml files.

I had to add curie-util 0.0.2 as an explicit dependency, otherwise version 0.0.1 would be brought in as a transitive dependency, and golr-loader seems to be coded for 0.0.2.

I wrote a test case in GolrLoaderTest which should test that object_closure is working. I hand-wrote a test graph which is set up in GolrLoadSetup. The entire GolrLoaderTest module was marked as @Ignore, so I removed it and added @Ignore to each individual test, then added my new test. I'm not sure why all these tests are being ignored, but it could be that the fixtures simply need to be updated and that hasn't been done yet.

I'm using the gene-anatomy query from monarch-cypher-queries as the test query. The resulting object_closure value I get after running the query on my test graph contains:

["http://purl.obolibrary.org/obo/UBERON_0001890","http://purl.obolibrary.org/obo/UBERON_0000955","http://purl.obolibrary.org/obo/UBERON_0000033","http://x.org/body_part","http://purl.obolibrary.org/obo/UBERON_0001062"]

so it looks like the closure query is working.

There is a weird behavior in GolrLoader.serializerRow that I don't quite understand. If the cypher query returns a relation, the code gets the iri property of the relation, then attempts to find a node in the graph with a String ID that matches the relation's iri value. I'm not sure why it's doing this. As a workaround, I had to alter the gene-anatomy query used in the test case so that it does not return the matched relation. Why would a node have an ID that matches the iri of one of its relations? Maybe someone else can shed some light on this. Here is the code from GolrLoader.serializerRow:

      } else if (value instanceof Relationship) {
        String objectPropertyIri =
            GraphUtil.getProperty((Relationship) value, CommonProperties.IRI, String.class).get();
        Node objectProperty = graphDb.getNodeById(graph.getNode(objectPropertyIri).get());
        serializer.serialize(key, objectProperty);
      }

Fixes #17.

kshefchek commented 7 years ago

+1

Since relations have structure (class-subclass), they are in the graph as both nodes and edges. In order to generate the relation subclass closure we need to first return the node with the same IRI and traverse its parents.

benwbooth commented 7 years ago

@kshefchek That makes sense. Thanks for the explanation!

kshefchek commented 7 years ago

Although I'm also thrown off by the node ID part, the node ID is numeric so I'm not sure how this is working, unless the method is referencing the IRI property.

benwbooth commented 7 years ago

The GraphTransactionalImpl module in SciGraph defines an idMap which is a hashmap of String to numeric node IDs. So every time you add a node to a graph, it adds a string ID mapping for it. The String ID not actually stored in the Neo4j database.

kshefchek commented 7 years ago

Regardless, we do want to enable this to work on relations since we index the relation subclass closure, for example:

https://solr.monarchinitiative.org/solr/golr/select/?defType=edismax&qt=standard&indent=on&wt=json&rows=10&start=0&fl=subject,subject_label,object,object_label,relation,relation_label,relation_closure,evidence,evidence_label,source,is_defined_by,qualifier&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&facet.method=enum&fq=object_category:%22phenotype%22&fq=subject_category:%22disease%22&fq=subject_closure:%22OMIM:209850%22&q=:

This isn't really a good example, but we use it when querying homology relations.

kshefchek commented 7 years ago

I vaguely recall now an external map file that contains IRI to node/edge ID mappings, this is likely what is in the SciGraphIdMap file.

Edit: yes this seems to be the case:

File dbLocation = new File("/home/kshefchek", "SciGraphIdMap");
DB db = DBMaker.newFileDB(dbLocation).closeOnJvmShutdown().transactionDisable().mmapFileEnable().make();
Map<String, Object> map = db.getHashMap("io.scigraph.neo4j.IdMap");

Iterator it = map.entrySet().iterator();
for (Map.Entry<String, Object> entry : map.entrySet()) {
  String key = entry.getKey();
  Object value = entry.getValue();
  System.out.println(key + ": " + value);
}

Outputs:

http://purl.obolibrary.org/obo/GO_0000578: 319175
http://purl.obolibrary.org/obo/GO_0000502: 575893
http://purl.obolibrary.org/obo/GO_0000503: 575891

... etc

kshefchek commented 7 years ago

This is going to sound hypocritical given the current state of the code, but I would like to eventually move away from the hard coding of subject, object, etc. These refer to a specific solr schema, and we can foresee cases where we will want to reuse this code with an entirely different schema. This is probably not needed for this PR and can wait, @cmungall your thoughts?

Eventually I think it would be nice to make much of this configurable, for example something like this:

query: |
    MATCH path=(foo)<-[bar:Owl:relationship]-(baz)
    RETURN DISTINCT path,
    foo, bar, baz

expandedFields:
    foo_closure:
        relations:
            - OWL:subClassOf
            - BFO:part_of
            - Some:transitiveProperty
        type: closure
        label: foo_label
        map: foo_map
    foo_gene:
        relations:
            - RO:has_gene
        type: direct
        label: foo_gene_label
        map: foo_gene_map
kshefchek commented 7 years ago

@benwbooth could you post an example config? I will test it on a sample graph with our queries.

benwbooth commented 7 years ago

@kshefchek Here is the example config I used for testing:

query: |
        MATCH path=(subject:gene)-[relation:RO:0002206]->(object:`anatomical entity`)
        RETURN DISTINCT path,
        subject, object, relation,
        'gene' AS subject_category,
        'anatomy' AS object_category,
        'direct' AS qualifier
object_closure: "rdfs:subClassOf|BFO:0000050"
kshefchek commented 7 years ago

Thanks! Is the default behavior to fall back to the hardcoded relations?

benwbooth commented 7 years ago

Yes, and the hardcoded relations are automatically added to any custom relations you specify as well.

kshefchek commented 7 years ago

Ran locally, all looks good to me!