geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

Materialize subClassOf closures for each GO term for Blazegraph load #708

Open dougli1sqrd opened 6 years ago

dougli1sqrd commented 6 years ago

The ontology loaded into Blazegraph should materialize rdfs:subClassOf links for each class in the ontology.

If we have:

:A rdfs:subClassOf :B .
:B rdfs:subClassOf :C .
:C rdfs:subClassOf :D .

We should make for :A

:A rdfs:subClassOf :B .
:A rdfs:subClassOf :C .
:A rdfs:subClassOf :D .

In this way when querying terms in SPARQL, we do not have to use costly property paths. Currently:

?A rdfs:subClassOf* :D .

If we had the extra subclass links, we could do:

?A rdfs:subClassOf :D .

Which will be much faster.

@cmungall What do you think? Is this something we should do? I think many useful queries will be prohibitively slow at the moment.

balhoff commented 6 years ago

I think this should go in another graph. You could create this using a SPARQL Update with blazegraph-runner, or using Arachne via the reasoner command and a custom rule.

dougli1sqrd commented 6 years ago

Oh you think so? Why should this go in another graph? @cmungall thinks this should be part of the simplified rdf annotations.

balhoff commented 6 years ago

Only if you wanted to easily distinguish what triples came from the published ontologies vs. what was added. If there's never a reason to do that then I guess it doesn't matter. But you might have some client application that wants to do progressive queries down or up the hierarchy. That would be hard if all the redundant subclassOfs are mixed in.

dougli1sqrd commented 6 years ago

oh yeah that's a good point.

dougli1sqrd commented 6 years ago

For example: SPARQL for gorule 6 takes a long time:

      PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      PREFIX owl: <http://www.w3.org/2002/07/owl#>
      PREFIX RO: <http://purl.obolibrary.org/obo/RO_>
      PREFIX has_evidence: <http://purl.obolibrary.org/obo/RO_0002612>
      PREFIX enabled_by: <http://purl.obolibrary.org/obo/RO_0002333>
      PREFIX occurs_in: <http://purl.obolibrary.org/obo/BFO_0000066>
      PREFIX ECO: <http://purl.obolibrary.org/obo/ECO_>
      PREFIX IEP: <http://purl.obolibrary.org/obo/ECO_0000270>
      PREFIX GO: <http://purl.obolibrary.org/obo/GO_>
      PREFIX biological_process: <http://purl.obolibrary.org/obo/GO_0008150>
      PREFIX molecular_function: <http://purl.obolibrary.org/obo/GO_0003674>
      PREFIX namespace: <http://www.geneontology.org/formats/oboInOwl#hasOBONamespace>
      PREFIX metago: <http://model.geneontology.org/>

      SELECT DISTINCT ?ecotype ?GoTerm ?namespace
      WHERE {

        GRAPH ?g {
          ?g metago:graphType metago:gafCam .

          ?b has_evidence: ?evidence .
          ?evidence a IEP: .
          ?evidence a ?ecotype .

          # Main Triples
          # enabled_by will find innapropriate IEP in Molecular Function
          # occurs_in will find innaporpriate IEP in Cellular Component
          {
            # ?s is a MF, ?o is a GP
            ?s enabled_by: ?o .
            ?s a ?GoTerm .
          } UNION {
            # ?s is a MF, ?o is a Go Term
            ?s occurs_in: ?o .
            ?o a ?GoTerm .
          }

          ?b owl:annotatedSource ?s .
          ?b owl:annotatedTarget ?o .

          FILTER (?GoTerm != owl:NamedIndividual)
          FILTER (?ecotype != owl:NamedIndividual)
        }

        ?GoTerm namespace: ?namespace .
      }
      LIMIT 100