callahantiff / PheKnowLator

PheKnowLator: Heterogeneous Biomedical Knowledge Graphs and Benchmarks Constructed Under Alternative Semantic Models
https://github.com/callahantiff/PheKnowLator/wiki
Apache License 2.0
159 stars 29 forks source link

Set-up SPARQL Endpoint #69

Closed callahantiff closed 2 years ago

callahantiff commented 3 years ago

TASK

Task Type: PKT DATA DELIVERY

Select and set-up a SPARQL endpoint for exploring KG build data

TODO

Questions:

callahantiff commented 3 years ago

Follow-up

callahantiff commented 3 years ago

@bill-baumgartner - I successfully brought down the endpoint tonight 😄. It's running again, I restarted the container and it came back. The query I ran is shown below. It went down because I did not include LIMIT. I wonder if we should add something to protect from others doing this, or if there is something we can add to help it restart itself in these situations. Just something for us to discuss tomorrow!

PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?s ?p ?o 
WHERE { 
  VALUES ?p {
    obo:RO_0000087
    obo:RO_0002434
    rdfs:subClassOf
  }
  ?s ?p ?o 
} 

Note when adding LIMIT n the query executes totally fine. This query format is the template RH provided me so that's why I was testing it ot.

bill-baumgartner commented 3 years ago

Good to know. From grepping our input n-triples file, we would expect the following numbers of responses:

So, in total, this query would have eclipsed the 5M triple limit we had originally set. It may be the case that for queries with many results, users will need to request results in batches using ORDER BY + LIMIT + OFFSET.

callahantiff commented 3 years ago

I agree. I did some experimenting using the SPARQL Proxy settings (i.e. ENABLE_QUERY_SPLITTING and MAX_CHUNK_LIMIT) in docker-compose.yml and have some interesting insight to share in our meeting this afternoon. In a nutshell, I can get it to return all of the results, but then generate a different error when trying to return the results (which when using the ENABLE_QUERY_SPLITTING setting returns JSON). See below:

buffer.js:799
api_1      |     return this.utf8Slice(start, end);
api_1      |                 ^
api_1      | 
api_1      | Error: Cannot create a string longer than 0x1fffffe8 characters
api_1      |     at Buffer.toString (buffer.js:799:17)
api_1      |     at Request.<anonymous> (/app/node_modules/request/request.js:1128:39)
api_1      |     at Request.emit (events.js:315:20)
api_1      |     at IncomingMessage.<anonymous> (/app/node_modules/request/request.js:1076:12)
api_1      |     at Object.onceWrapper (events.js:421:28)
api_1      |     at IncomingMessage.emit (events.js:327:22)
api_1      |     at endReadableNT (internal/streams/readable.js:1327:12)
api_1      |     at processTicksAndRejections (internal/process/task_queues.js:80:21) {
api_1      |   code: 'ERR_STRING_TOO_LONG'
api_1      | }
api_1      | npm ERR! code ELIFECYCLE
api_1      | npm ERR! errno 1
api_1      | npm ERR! sparql-proxy@0.0.0 start: `node --experimental-modules src/server.mjs`
api_1      | npm ERR! Exit status 1
api_1      | npm ERR! 
api_1      | npm ERR! Failed at the sparql-proxy@0.0.0 start script.
api_1      | npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
api_1      | 
api_1      | npm ERR! A complete log of this run can be found in:
api_1      | npm ERR!     /app/.npm/_logs/2021-01-08T18_52_49_474Z-debug.log
callahantiff commented 3 years ago

@bill-baumgartner so we have a record, here are the two queries we ran against the Endpoint via the command line:

wget Simple:

wget -qO- "http://35.233.212.30/blazegraph/sparql?query=select * where { ?s ?p ?o } " > filename.xml


wget Relation Template:.

wget -qO- "http://35.233.212.30/blazegraph/sparql?query=PREFIX%20obo%3A%20%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F%3E%20PREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%20SELECT%20%3Fs%20%3Fp%20%3Fo%20WHERE%20%7B%20VALUES%20%3Fp%20%7B%20obo%3ARO_0000087%20obo%3ARO_0002434%20rdfs%3AsubClassOf%20%7D%20%3Fs%20%3Fp%20%3Fo%20%7D" > filename.xml
callahantiff commented 3 years ago

UPDATE

Need to do the following things to fully address this issue:

callahantiff commented 3 years ago

When it goes down, run the following from the location shown below within the GCP instance:

~/PheKnowLator/builds/deploy/triple-store$ docker-compose up -d
callahantiff commented 2 years ago

@bill-baumgartner - I am going to close this for now. I think the 99% automated approach we are using now is totally fine as the endpoint is not something we plan to keep forever. Let me know if you disagree.