HeardLibrary / vandycite

0 stars 0 forks source link

test federated queries to Neptune #81

Open baskaufs opened 2 years ago

baskaufs commented 2 years ago

There are two possible important use cases for doing federated queries against the Neptune triplestore:

  1. Use an application like Fuseki running on a localhost or elsewhere to be able to use it's GUI to experiment with queries.
  2. To make use of the contents of Neptune as well as some other endpoing like the Wikidata query service.

Note: in the second case, the federated query needs to be done at a third SPARQL endpoint, since Neptune isn't able to make federated queries to endpoints outside its VPC due to security reasons.

baskaufs commented 2 years ago

Spun up localhost Fuseki (which supports federated queries unlike Blazegraph) SPARQL interface to run a test.

Note: my first query was

SELECT DISTINCT ?s ?p ?o
WHERE
{
  SERVICE <https://5j6diw4i0h.execute-api.us-east-1.amazonaws.com/sparql> { 
?s ?p ?o
  }
}
limit 5

which turned out to be a bad idea, since it apparently tried to pass all of the millions of triples from Neptune to Fuseki before imposing the limit of 5. It resulted in a 503 (or something) error: service unavailable, which sounds like a pretty bad outcome.

Tried a second query:

PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT DISTINCT ?s ?label ?parent
WHERE
{
  SERVICE <https://5j6diw4i0h.execute-api.us-east-1.amazonaws.com/sparql> { 
?s skos:prefLabel ?label.
FILTER(lang(?label)='en')
?top skos:prefLabel 'Visual Arts'@en.
?s skos:broader+ ?top.
?s skos:broader ?parent.
  }
}

which worked and confirmed that federated queries work fine.

baskaufs commented 2 years ago

Tried using rdflib to perform a federated query in Python. See https://rdflib.readthedocs.io/en/stable/intro_to_sparql.html#querying-a-remote-service for an example.

# hack of example given in documentation
import rdflib

g = rdflib.Graph()
qres = g.query(
    """
    prefix wd: <http://www.wikidata.org/entity/>
    SELECT distinct ?p ?o
    WHERE {
      SERVICE <https://query.wikidata.org/sparql> {
        wd:Q42 ?p ?o .
      }
    }
    LIMIT 10
    """
)

for row in qres:
    print(row)

This query worked fine.

I tried running some simple federated queries like

import rdflib

g = rdflib.Graph()
qres = g.query(
    """
SELECT DISTINCT ?class
    WHERE {
      SERVICE <https://5j6diw4i0h.execute-api.us-east-1.amazonaws.com/sparql> {
<http://rs.tdwg.org/dwc/terms/continent> a ?class.
      }
    }
    """
)

for row in qres:
    print(row)

But it failed with a 404 (not found). Fell back to direct query in Fuseki:

SELECT DISTINCT ?class
    WHERE {
      SERVICE <https://5j6diw4i0h.execute-api.us-east-1.amazonaws.com/sparql> {
<http://rs.tdwg.org/dwc/terms/continent> a ?class.
      }
    }

and got infinitely spinning circle. However, when I restarted Fuseki, it worked. Tried restarting the kernal in the Jupyter notebook, but that didn't help. Still got a 404, which doesn't make sense.