Swirrl / drafter

A clojure service and a client to it for exposing data management operations to PMD
Other
0 stars 0 forks source link

IDEA: use from / from named to restrict queries (for caching) #236

Open ricroberts opened 6 years ago

ricroberts commented 6 years ago

If you have a query like this:

SELECT * 
WHERE {
   GRAPH <http://dataset-graph> {
     # observation clauses
    ?s a qb:Observation .
    ?s ?p ?o .
  }

  ?o rdfs:label ?lbl
}

We don't know what graph the labels might come from, so with the new caching approach, this query would need to use the modified time of the whole endpoint in the cache key. But we could do a pre-query to find a list of all the graphs which contain vocabs or geography data.

One option would be to use a VALUES clause

SELECT * 
WHERE {
   GRAPH <http://dataset-graph> {
     # observation clauses
    ?s a qb:Observation .
    ?s ?p ?o .
  }

 GRAPH ?g {
   ?o rdfs:label ?lbl
 }
# populate values clause with results of pre-query:
VALUES ?g { <graph-1> <graph-2> <graph-3> } 
}

... but we know stardog isn't very good at optimising these queries.

A better alternative might be to use FROM or FROM NAMED:

SELECT * 
FROM <graph-1> 
FROM <graph-2> 
FROM <graph-3>
FROM NAMED <http://dataset-graph> # OPTIONAL
WHERE {
   GRAPH <http://dataset-graph> {
     # observation clauses
    ?s a qb:Observation .
    ?s ?p ?o .
  }

  ?o rdfs:label ?lbl
}

Drafter should:

  1. parse the query for any literal graphs used in GRAPH clauses
  2. parse the query for FROM NAMED graphs
  3. parse the query for FROM graphs

use the latest modified time of (1 intersect 2) and 3 as the modified time on the cache key.

We should probably also support passing FROM and FROM NAMED on the request as parameters (which would mean that we don't need to parse the query for getting the modified date for caching purposes)

RickMoynihan commented 6 years ago

We should probably also support passing FROM and FROM NAMED on the request as parameters (which would mean that we don't need to parse the query for getting the modified date for caching purposes)

We should definitely do this, with the SPARQL 1.1 Query Protocol parameters named-graph-uri (maps to FROM NAMED) & default-graph-uri (maps to FROM). These need to be supported on all SPARQL query endpoints (e.g. draftsets/live etc).

Additionally we should support the common case of providing better caching/hinting on the default graph via queries that touch vocab graphs, e.g. you might have a query like this (pseudo sparql):

SELECT * WHERE {
  GRAPH <http://my-dataset/graph> {
     ?ds a qb:DataSet ;
         qb:structure/qb:component/qb:codeList/skos:member ?scheme .
  }
  ?scheme rdfs:label ?lbl .
}

In which case it would be good to set on the request a special hint with the query params ?drafter-named-graph-uri and ?drafter-default-graph-uri. These would essentially be the same as the SPARQL 1.1. ones, except that the drafter variations will expand virtual URI's which have special meaning, e.g. the URI <http://publishmydata.com/drafter/graph/all-vocabs> which would be expanded into the set of all vocab graph URIs. Similarly we may support virtual URIs for drafter-graphs:all-datasets, drafter-graphs:all-ontologies etc.

The set of URI's for FROM and FROM NAMED would then be intersected with those allowed for the endpoint and used to calculate modified times for stasher query caching.

URIs supplied on the SPARQL 1.1. *-uri parameters would also be honored in a similar way, but not subject to expansion.

The motivation for this is:

  1. Queries can opt in to better caching behaviour, as vocabs rarely change but the top level default-graph/endpoint/database modified time does. Queries like the above need to use the default graph.
  2. By using a drafter-* variant of SPARQL parameters we can introduce our own special URI semantics, and remain compatible with non drafter endpoints used in dev e.g. when running against a raw stardog, as stardog will ignore the extra parameters.
  3. At some point it would be good to parse queries for bound GRAPH restrictions too, but we think using a special query parameter is a better way to hint things; as the GRAPH approach wont work for setting the default graph, and may result in sub-optimal query plans when using hacks like: VALUES ?g { ,,, }.
RickMoynihan commented 6 years ago

Also we need to fix this issue before doing this one:

https://github.com/Swirrl/drafter/issues/235