LDflex / LDflex-Comunica

Comunica query engine support for the LDflex language
MIT License
16 stars 5 forks source link

Queries involving blank nodes return Bad Request error. #22

Open jeswr opened 3 years ago

jeswr commented 3 years ago

Not exactly sure whether this best belongs here, in the comunica repo, or possibly even the LDflex repo, depending on how you want to go about solving the issue.

When using LDflex to query over a SHACL constraint in a repo we are hosting - I want to use the following pattern.

const { PathFactory } = require('ldflex');
const { default: ComunicaEngine } = require('@ldflex/comunica');
const { namedNode } = require('@rdfjs/data-model');

// The JSON-LD context for resolving properties
const context = {
  "@context": {
    "@vocab": "http://xmlns.com/foaf/0.1/",
    "friends": "knows",
    "label": "http://www.w3.org/2000/01/rdf-schema#label",
    "property": "http://www.w3.org/ns/shacl#property",
    "path": "http://www.w3.org/ns/shacl#path"
  }
};
// The query engine and its source
const queryEngine = new ComunicaEngine('http://rsmsrv01.nci.org.au:8890/sparql');
// The object that can create new paths
const path = new PathFactory({ context, queryEngine });

async function showShape(shape: any) {
  console.log(`This person is ${await shape.label}`);
    for await (const property of shape.property) {
        console.log("path associated to property", `${await property.label}`)
    }
}

const myShape = path.create({ subject: namedNode('http://example.org/humanWikidataShape') });
showShape(myShape);

Because each property in the shape is a blank node property is a blank node so calling await property.label cases the query

SELECT ?label WHERE {
  <urn:comunica_skolem:source_0:nodeID://b12796> <http://www.w3.org/2000/01/rdf-schema#label> ?label.
}

to be issued by LDflex.

With the current engine setup this query eventually gets passed through the replaceBlankNodes function within ActorRdfResolveQuadPatternSparqlJson and hence the query that actually gets sent by the engine is

SELECT ?nodeID://b12796 ?label WHERE { ?nodeID://b12796 <http://www.w3.org/2000/01/rdf-schema#label> ?label. }

This has invalid variable names so the SPARQL endpoint returns an invalid variable name error

Error: Invalid SPARQL endpoint (http://rsmsrv01.nci.org.au:8890/sparql) response: Bad Request (400)
    at AsyncIteratorJsonBindings.fetchBindingsStream (/home/jesse/Documents/github/on2ts-dev/on2ts/node_modules/@comunica/actor-rdf-resolve-quad-pattern-sparql-json/lib/AsyncIteratorJsonBindings.js:56:19

But more to the point; the blank nodes should not be changed to variables here. I'm guessing either the query engine config needs to be fixed here; there is some bug in comunica engine.

rubensworks commented 3 years ago

This may be a Comunica bug indeed. I suspect something is failing on the // in the blank node label. If so, then the problem probably lies here somewhere: https://github.com/comunica/comunica/blob/45b015588dce4723ac9259314723ca4e598f6417/packages/actor-rdf-resolve-quad-pattern-federated/lib/FederatedQuadSource.ts#L80-L132

jeswr commented 3 years ago

Part of the issue is that the configuration of the engine is converting blank nodes to variables (https://github.com/comunica/comunica/blob/bf5e6a2a0807188bd0bab224da1746b9f48c5d33/packages/actor-rdf-resolve-quad-pattern-sparql-json/lib/ActorRdfResolveQuadPatternSparqlJson.ts#L34-L68) which in this particular use case, and IMO most use cases for LDflex, is unexpected behavior (as a matter of fact I don't quite follow why this conversion is being done in the default configuration of the comunica engine either) so this needs to be changed somehow.

The second issue is that Virtuoso expects blank nodes to be of the form

SELECT ?label WHERE { <nodeID://b12796> <http://www.w3.org/2000/01/rdf-schema#label> ?label. }

I created a PR in SPARQL.js so that this doesn't get resolved as a relative IRI - but in addition comunica needs to convert the blank node from the skolemized format to this format.

I think the best way of doing this would be to create some actors for specific SPARQL endpoints (virtuoso, fuseki etc.) that can handle this as well as other components of the endpoints that are not following the SPARQL spec correctly. If you think this is the correct approach I make a fork to create these actors.

Regarding https://github.com/comunica/comunica/blob/45b015588dce4723ac9259314723ca4e598f6417/packages/actor-rdf-resolve-quad-pattern-federated/lib/FederatedQuadSource.ts#L80-L132 - one could strip off the nodeID:// in the skolemisation process - and then add this back in the Virtuoso actor (though I don't think this is necessary)

rubensworks commented 3 years ago

Part of the issue is that the configuration of the engine is converting blank nodes to variables

This is a standard operation in SPARQL query engines. A blank node occuring within a SPARQL query is basically a non-selectable variable. Instead of having checks all over the place for if var or bnode, engines typically normalize all to actual variables. (Note that this is not done for blank nodes occuring in the data)

The second issue is that Virtuoso expects blank nodes to be of the form

SELECT ?label WHERE { nodeID://b12796 http://www.w3.org/2000/01/rdf-schema#label ?label. }

<nodeID://b12796> is not a blank node, but a skolem IRI. So Virtuoso seems to be handling (internal) blank nodes in a safe way already, so Comunica should be able to work with this across different query executions. Since this is not a blank node, the skolemization logic inside Comunica's FederatedQuadSource will never be called.

AFAICS, the only problem then seems to be the SPARQL parsing issues, which you seem to have fixed already in https://github.com/RubenVerborgh/SPARQL.js/pull/124. If you use this patched version, does everything work for you?

jeswr commented 3 years ago

Thanks for the clarifications!

The patch allowed me to successfully run the following query with the skolem IRI (using the default setup for the comunica engine with nodejs)

SELECT ?label WHERE { <nodeID://b12796> <http://www.w3.org/2000/01/rdf-schema#label> ?label. }

the original query

SELECT ?label WHERE {
  <urn:comunica_skolem:source_0:nodeID://b12796> <http://www.w3.org/2000/01/rdf-schema#label> ?label.
}

still gets converted to

SELECT ?nodeID://b12796 ?label WHERE { ?nodeID://b12796 <http://www.w3.org/2000/01/rdf-schema#label> ?label. }

I'll make a PR later that strips off the nodeID:// in the federation stage so that this becomes a valid query - as you've pointed out this is expected behavior as in most cases the SPARQL query engine will just convert the blank nodes to variables anyway.

The problem then becomes - running the following code

async function showShape(shape: any) {
  console.log(`This person is ${await shape.label}`);
    for await (const property of shape.property) {
        **console.log("path associated to property", `${await property.label}`)**
    }
}

would produce fairly unintuitive results as calling await property.label will now return all labels in the graph as opposed to the label associated to a particular property in a shape (or it just returns the first one that arrives - I can't remember exactly what LDflex does here) .

In the short term I can just use Named Nodes for each property in my data. Long term it would be good to have this handled in some way - in cases where one is working with an endpoint that uses skolem IRI's, I guess a specialised comunica actor could test and ensure that the

SELECT ?label WHERE { <nodeID://b12796> <http://www.w3.org/2000/01/rdf-schema#label> ?label. }

query is issued. If this is not the case I guess something like

SELECT ?property ?label WHERE { 
<http://example.org/humanWikidataShape> <http://www.w3.org/2000/01/rdf-schema#property>?property .
?property <http://www.w3.org/2000/01/rdf-schema#label> ?label. 
}

would have to be issued by LDflex (similarly to what is sent if you queried await shape.property.label).

Has there been any kind of discussion surrounding how LDflex should handle these kinds of issues with the semantics of blank nodes?

rubensworks commented 3 years ago

the original query

SELECT ?label WHERE {
<urn:comunica_skolem:source_0:nodeID://b12796> <http://www.w3.org/2000/01/rdf-schema#label> ?label.
}

If this query is obtained, then Comunica must be receivingnodeID://b12796 as a blank node instead of an IRI (so _:nodeID://b12796 instead of <nodeID://b12796>). Not sure why this might be happening, but something may be going wrong in one (or more) of:

Has there been any kind of discussion surrounding how LDflex should handle these kinds of issues with the semantics of blank nodes?

The history of this feature can be read here: https://github.com/solid/query-ldflex/issues/34

jeswr commented 3 years ago

I will investigate what is going on and read up on https://github.com/solid/query-ldflex/issues/34 - thanks!

jeswr commented 3 years ago

If this query is obtained, then Comunica must be receiving nodeID://b12796 as a blank node instead of an IRI (so _:nodeID://b12796 instead of <nodeID://b12796>).

Indeed - these are the raw bindings that Virtuoso will send to comunica. That is, Virtuoso identifies them as Blank Nodes.

Hence this piece of code https://github.com/comunica/comunica/blob/bf5e6a2a0807188bd0bab224da1746b9f48c5d33/packages/actor-rdf-resolve-quad-pattern-federated/lib/FederatedQuadSource.ts#L87-L93 ingests the terms as such

Based on this - to me it seems the options to resolve are a) Create a new actor that is capable of overriding this termType component of the binding so that it is returned from Comunica as a named node. b) Still take it as a blank node and strip of the node:ID// the component so that the federeated bnode becomes <urn:comunica_skolem:source_0:b12796> and the offending query

SELECT ?nodeID://b12796 ?label WHERE { ?nodeID://b12796 <http://www.w3.org/2000/01/rdf-schema#label> ?label. }

becomes the valid query

SELECT ?b12796 ?label WHERE { ?b12796 <http://www.w3.org/2000/01/rdf-schema#label> ?label. }

in addition - create an actor that uses the context being passed around to identify that the SPARQL endpoint is from an instance of Virtuoso and sends use the priority mediator to obtain a custom actor (actor-query-operation-quadpattern-virtuoso maybe? - I haven't looked to deeply into which actors/buses are handling this) so that when LDflex issues

SELECT ?label WHERE {
  <urn:comunica_skolem:source_0:b12796> <http://www.w3.org/2000/01/rdf-schema#label> ?label.
}

to an endpoint hosted by Virtuoso it will becomes

SELECT ?label WHERE { <nodeID://b12796> <http://www.w3.org/2000/01/rdf-schema#label> ?label. }
rubensworks commented 3 years ago

Let me just summarize this issue for the sake of future reference:

Summary

LDflex allows data lookups via separate chained SPARQL query executions. However, when this data is connected via blank nodes, things become problematic, as bnodes can only be matched with each other in the scope of one document/result set. For this, skolemization can be applied to convert bnodes to IRIs.

Comunica already takes care of this for document-based sources, but this is not possible to apply for SPARQL endpoints in a generic manner.

For Virtuoso specifically, it may be possible to support this, as Virtuoso allows you to pass blank node labels as IRIs in a determitistic manner (e.g. _:nodeID://b12796 obtained in a result can be passed as <nodeID://b12796> in a next query)


And here's my reply to the comment above:

a) Create a new actor that is capable of overriding this termType component of the binding so that it is returned from Comunica as a named node.

If I understand correctly, you would basically intercept the response from here, and rewrite each bnode result so that it becomes a skolem IRI. If so, this would probably be the least invasive solution, which seems like a good route to me.

We could for example hook into bus-http for this (actor-http-sparql-virtuoso-skolemizer? (a more generic name may be possible as well, in case this approach is followed by other vendors as well)). This actor could work on all requests for 'application/sparql-results+json', and act as a passthrough actor that is always executed first, and just throws the action on the same bus again (a context entry will have to be set to avoid infinite recursion, see memento http actor as an example). If the HTTP response contains Virtuoso in the Server HTTP header, then the bnode's can be transformed into uri.