eclipse-rdf4j / rdf4j

Eclipse RDF4J: scalable RDF for Java
https://rdf4j.org/
BSD 3-Clause "New" or "Revised" License
367 stars 164 forks source link

Support type promotion for supported datatypes in BGP matching #3815

Open abrokenjester opened 2 years ago

abrokenjester commented 2 years ago

Problem description

Given the following data

:foo :bar "5"^^xsd:int

and the following sparql query:

select * where { ?s ?p 5 }

We will get an empty result (the cause being that the '5' in the query is a literal of type xsd:integer, not xsd:int, so it is not a direct match).

However, when we instead query:

select * where { ?s ?p ?v . filter(?v = 5) }

We do get a result, because the comparison operator applies type promotion as defined in https://www.w3.org/TR/xpath-functions/#comp.numeric .

Although strictly speaking correct with regards to the SPARQL 1.1 specs, this is un-intuitive. We should consider adding the same type promotion logic we use for comparison operators into BGP matching when typed literals are involved.

Preferred solution

An internal query (algebra) rewrite could be applied to change any StatementPattern involving a typed literal of a recognized datatype into a StatementPattern with a filter condition.

Pending vetting against the full test suite, we can support this as the default behavior by just building it straight into the TupleExprBuilder, while translating the parser AST into an algebra tree. In this case we can support "turning it off" by having a query optimizer check a config param and if set to disable, optimize back into a straight match

Are you interested in contributing a solution yourself?

Yes

Alternatives you've considered

Instead of the preferred approach we could leave the TupleExprBuilder as-is and instead build it a query optimizer that checks a config param and if set to enable, rewrite the algebra to use a filter/comparison operator (with type promotion). This has the advantage of leaving the default behavior as-is, but a downside is that this conceptually stretches what we mean by "optimization". Also it does not fully address the fact that the current (default) behavior is a little inconsistent.

A third option is to support this by means of a stacked sail. This has the advantage of not needing a new config parameter, but the downside of not easily making it applicable to existing data stores.

Anything else?

This draft proposal is an outcome of this dicussion thread: https://github.com/eclipse/rdf4j/discussions/3804

abrokenjester commented 2 years ago

Fwiw which solution we end up with is heavily dependent on how this affects compliance (I don't see any problems but there may be some test cases in our suite that contradict my reading of the specs), as well as any measurements we can do on performance impact.

abrokenjester commented 2 years ago

Another option is to build this as part of the ExtendedEvaluationStrategy.

seralf commented 2 years ago

Hi what if we see this as a form of inference? (and thus, the configurations applies to that inferencer, moving outside the concept of optimization to another metaphor)

abrokenjester commented 2 years ago

@seralf good point. I could imagine that we could support something similar to this by offering an inferencer specific for datatype entailment - or at least a restricted form of datatype entailment. I doubt this would be a forward-chaining inferencer (it would blow up the materialized set of inferred statements too much), but it's possible for an inferencer to work as a query rewriter - tweaking the algebra tree.

After further back-and-forth on the discussion thread, I don't think we want to pursue the original idea of implementing this as part of the core query engine or the extended strategy. The asked-for behavior seems to be non spec-compliant for simple/RDF entailment, and while the original request reported that for example Jena supports this, it turns out that it only does so in one particular (and now legacy) store implementation, and is certainly not standard behavior. But a D-entailment inferencer could probably offer this kind of thing without breaking specs. It also makes it more of an explicit configuration choice on the user's part.