GDD-Nantes / FedShop

Code for FedShop: The Federated Shop Benchmark
GNU General Public License v3.0
8 stars 0 forks source link

URIs with 'nan' in RSAs #71

Open hartig opened 6 days ago

hartig commented 6 days ago

The VALUES clause of some RSAs contains URIs with ?default-graph-uri=nan. As an example consider /GDD/RSFB/experiments/bsbm/benchmark/evaluation/arq/q04/instance_0/batch_0/attempt_0/service.sparql, which looks as follows.

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT DISTINCT ?product ?label ?propertyTextual WHERE {
    VALUES ( ?bgp1 ?bgp2 ) { ( <http://localhost:34205/sparql/?default-graph-uri=http://www.ratingsite2.fr/> <http://localhost:34205/sparql/?default-graph-uri=nan> ) ( <http://localhost:34205/sparql/?default-graph-uri=nan> <http://localhost:34205/sparql/?default-graph-uri=http://www.ratingsite2.fr/> ) }
    {
        SERVICE ?bgp1 { 
            ?product rdfs:label ?label .
            # const!* bsbm:ProductType630
            ?product rdf:type ?localProductType .
            ?localProductType owl:sameAs bsbm:ProductType630 .
            # const!* bsbm:ProductFeature19851
            ?product bsbm:productFeature ?localProductFeature1 .
            ?localProductFeature1 owl:sameAs bsbm:ProductFeature19851.
            # const** bsbm:ProductFeature19019 != bsbm:ProductFeature19851
            ?product bsbm:productFeature ?localProductFeature2 .
            ?localProductFeature2 owl:sameAs bsbm:ProductFeature19019.
            ?product bsbm:productPropertyTextual1 ?propertyTextual .
            ?product bsbm:productPropertyNumeric1 ?p1 .
            # const** "901.0"^^xsd:double < ?p1
            FILTER ( ?p1 > "901.0"^^xsd:double )
        } 
    } UNION {
        SERVICE ?bgp2 {
            ?product rdfs:label ?label .
            # const!* bsbm:ProductType630
            ?product rdf:type ?localProductType .
            ?localProductType owl:sameAs bsbm:ProductType630 .
            # const!* bsbm:ProductFeature19851
            ?product bsbm:productFeature ?localProductFeature1 .
            ?localProductFeature1 owl:sameAs bsbm:ProductFeature19851 .
            # const* bsbm:ProductFeature25702 != bsbm:ProductFeature19019, bsbm:ProductFeature19851
            ?product bsbm:productFeature ?localProductFeature3 .
            ?localProductFeature3 owl:sameAs bsbm:ProductFeature25702 .
            ?product bsbm:productPropertyTextual1 ?propertyTextual .
            ?product bsbm:productPropertyNumeric2 ?p2 .
            # const "519.0"^^xsd:double < ?p2
            FILTER ( ?p2 > "519.0"^^xsd:double ) 
        } 
    }
}
ORDER BY ?product ?label ?propertyTextual
##OFFSET 5
LIMIT 10

I have learned from @Chat-Wane that these are cases in which the corresponding variables is meant to be unbound and the corresponding SERVICE clause is meant to be dropped. I suggest that you capture these cases not by these nan-based URIs but, instead, by the SPARQL keyword UNDEF. For the example above, this would mean that the VALUES clause should look as follows.

VALUES ( ?bgp1 ?bgp2 ) { ( <http://localhost:34205/sparql/?default-graph-uri=http://www.ratingsite2.fr/> UNDEF ) ( UNDEF <http://localhost:34205/sparql/?default-graph-uri=http://www.ratingsite2.fr/> ) }

Another, related observation, specifically to the query above: Why does the VALUES clause contain two solution mappings? I think that, in this specific case, the two UNDEFs can be removed and the two solution mappings merged into one:

VALUES ( ?bgp1 ?bgp2 ) { ( <http://localhost:34205/sparql/?default-graph-uri=http://www.ratingsite2.fr/> <http://localhost:34205/sparql/?default-graph-uri=http://www.ratingsite2.fr/> ) }

Don't you agree?

mhoangvslev commented 5 days ago

I suggest that you capture these cases not by these nan-based URIs but, instead, by the SPARQL keyword UNDEF. For the example above, this would mean that the VALUES clause should look as follows.

I was aware of UNDEF but it would not work as Jena will return Service URI not bound: ?bgp2 while executing the following query:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT DISTINCT ?product ?label ?propertyTextual WHERE {
    VALUES ( ?bgp1 ?bgp2 ) { ( <http://host.docker.internal:34201/sparql/?default-graph-uri=http://www.ratingsite2.fr/> UNDEF ) ( UNDEF <http://host.docker.internal:34201/sparql/?default-graph-uri=http://www.ratingsite2.fr/> ) }
    {
        SERVICE ?bgp1 { 
            ?product rdfs:label ?label .
            # const!* bsbm:ProductType630
            ?product rdf:type ?localProductType .
            ?localProductType owl:sameAs bsbm:ProductType630 .
            # const!* bsbm:ProductFeature19851
            ?product bsbm:productFeature ?localProductFeature1 .
            ?localProductFeature1 owl:sameAs bsbm:ProductFeature19851.
            # const** bsbm:ProductFeature19019 != bsbm:ProductFeature19851
            ?product bsbm:productFeature ?localProductFeature2 .
            ?localProductFeature2 owl:sameAs bsbm:ProductFeature19019.
            ?product bsbm:productPropertyTextual1 ?propertyTextual .
            ?product bsbm:productPropertyNumeric1 ?p1 .
            # const** "901.0"^^xsd:double < ?p1
            FILTER ( ?p1 > "901.0"^^xsd:double )
        } 
    } UNION {
        SERVICE ?bgp2 {
            ?product rdfs:label ?label .
            # const!* bsbm:ProductType630
            ?product rdf:type ?localProductType .
            ?localProductType owl:sameAs bsbm:ProductType630 .
            # const!* bsbm:ProductFeature19851
            ?product bsbm:productFeature ?localProductFeature1 .
            ?localProductFeature1 owl:sameAs bsbm:ProductFeature19851 .
            # const* bsbm:ProductFeature25702 != bsbm:ProductFeature19019, bsbm:ProductFeature19851
            ?product bsbm:productFeature ?localProductFeature3 .
            ?localProductFeature3 owl:sameAs bsbm:ProductFeature25702 .
            ?product bsbm:productPropertyTextual1 ?propertyTextual .
            ?product bsbm:productPropertyNumeric2 ?p2 .
            # const "519.0"^^xsd:double < ?p2
            FILTER ( ?p2 > "519.0"^^xsd:double ) 
        } 
    }
}
ORDER BY ?product ?label ?propertyTextual
##OFFSET 5
LIMIT 10

Another, related observation, specifically to the query above: Why does the VALUES clause contain two solution mappings? I think that, in this specific case, the two UNDEFs can be removed and the two solution mappings merged into one [...]

SELECT DISTINCT ?bgp1 ?bgp2 WHERE { { GRAPH ?bgp1 { ?product rdfs:label ?label .

const!* bsbm:ProductType630

        ?product rdf:type ?localProductType .
        ?localProductType owl:sameAs bsbm:ProductType630 .
        # const!* bsbm:ProductFeature19851
        ?product bsbm:productFeature ?localProductFeature1 .
        ?localProductFeature1 owl:sameAs bsbm:ProductFeature19851.
        # const** bsbm:ProductFeature19019 != bsbm:ProductFeature19851
        ?product bsbm:productFeature ?localProductFeature2 .
        ?localProductFeature2 owl:sameAs bsbm:ProductFeature19019.
        ?product bsbm:productPropertyTextual1 ?propertyTextual .
        ?product bsbm:productPropertyNumeric1 ?p1 .
        # const** "901.0"^^xsd:double < ?p1
        FILTER ( ?p1 > "901.0"^^xsd:double )
    } 
} UNION {
    GRAPH ?bgp2 {
        ?product rdfs:label ?label .
        # const!* bsbm:ProductType630
        ?product rdf:type ?localProductType .
        ?localProductType owl:sameAs bsbm:ProductType630 .
        # const!* bsbm:ProductFeature19851
        ?product bsbm:productFeature ?localProductFeature1 .
        ?localProductFeature1 owl:sameAs bsbm:ProductFeature19851 .
        # const* bsbm:ProductFeature25702 != bsbm:ProductFeature19019, bsbm:ProductFeature19851
        ?product bsbm:productFeature ?localProductFeature3 .
        ?localProductFeature3 owl:sameAs bsbm:ProductFeature25702 .
        ?product bsbm:productPropertyTextual1 ?propertyTextual .
        ?product bsbm:productPropertyNumeric2 ?p2 .
        # const "519.0"^^xsd:double < ?p2
        FILTER ( ?p2 > "519.0"^^xsd:double ) 
    } 
}

}

ORDER BY ?bgp1 ?bgp2

OFFSET 5

LIMIT 10


- Executing this provenance query gives the folliwng CSV:
```csv
bgp1,bgp2
http://www.ratingsite2.fr/,
,http://www.ratingsite2.fr/
mhoangvslev commented 4 days ago

Update:

I was aware of UNDEF but it would not work as Jena will return Service URI not bound: ?bgp2 while executing the following query [...]

Still doesn't work with Jena 5.0.0

I also made my case here: https://github.com/apache/jena/issues/2556

mhoangvslev commented 4 days ago

The solution suggested by Jena, is to use SERVICE SILENT paired with UNDEF. However, doing so would result in extra tuples every time UNDEF is paired with SERVICE. Which do you prefer @Chat-Wane, @hartig ?

afs commented 4 days ago

However, doing so would result in extra tuples every time UNDEF is paired with SERVICE.

Filter out using bound(?var).

hartig commented 4 days ago

Thanks @mhoangvslev for taking this up, and @afs for commenting!

@mhoangvslev

However, doing so would result in extra tuples every time UNDEF is paired with SERVICE.

Can you elaborate? I don't see this - at least not for the RSAs of FedShop. I mean, I see two cases:

The first case is a single SERVICE clause and, thus, a single-column table in the corresponding VALUES clause of the VALUES-based representation of RSAs. In this case, the only option for an extra tuple to occur is if there is an UNDEF in this single column of the VALUES clause, right? But then, the only option for such an UNDEF to occur is if one of the solution mappings produced by the corresponding provenance query would be the empty mapping, and that should never happen I think.

The second case is any query with multiple SERVICE clauses and, thus, a multi-column table in the corresponding VALUES clause. In this case, the only option for an extra tuple to occur is if one of the rows of the VALUES table contains UNDEF in every column. For this to be the case, the corresponding provenance query should also have produced at least one empty solution mapping, which again I think should never happen.

I am I overlooking something (or do you mean something completely different when you say "extra tuple")?

mhoangvslev commented 4 days ago

@hartig I believe you are referring to the tuple inside the VALUES while I am referring to the results of after RSA execution.

If you use UNDEF + SERVICE SILENT, Jena will raise a warning internally instead of throwing an exception, then produces an empty result row (or tuple).

@afs suggests adding FILTER(BOUND (...)) but it might interfere with the original intention of the query. For example: filter bound on an optional variable, filter bound conflicting with a filter not bound on the same variable, etc.

afs commented 3 days ago

However, doing so would result in extra tuples every time UNDEF is paired with SERVICE.

but it might interfere with the original intention of the query

The two options to ignoring a SERVICE request are no rows or one row.

If you want no rows, put FILTER (bound()) in the right place (immediately after the SERVICE, both inside a single { }, and it will remove any row where there was UNDEF to SERVICE SILENT. No other changes to the query.

hartig commented 3 days ago

No, I am indeed referring to the result produced by executing an RSA with UNDEF in the VALUES clause.

Here is a concrete example for the first case in my previous comment: Consider the following RSA.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ex: <http://example.org/endpoint/>

SELECT ?s ?o WHERE {
    VALUES ?ep { ex:1 ex:2 ex:3 }
    SERVICE SILENT ?ep { ?s rdfs:label ?o }
}

The result of this RSA should consist of at least three solution mappings (at least one from each of the three endpoints, perhaps more), but no such "extra tuple" that you mention. Now, to get such an "extra tuple", there would have to be an UNDEF in the VALUES clause of the RSA. For instance:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ex: <http://example.org/endpoint/>

SELECT ?s ?o WHERE {
    VALUES ?ep { ex:1 ex:2 UNDEF ex:3 }
    SERVICE SILENT ?ep { ?s rdfs:label ?o }
}

The result of this RSA should consist of all the solution mappings that are in the result of the previous RSA plus one additional solution mapping that is the empty mapping. Do you agree up to this point?

My argument now is that such an RSA with UNDEF would never be produced for FedShop, because for it to be produced, the corresponding provenance query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ex: <http://example.org/endpoint/>

SELECT DISTINCT ?ep WHERE {
    GRAPH ?ep { ?s rdfs:label ?o }
}

..would have to return the following result (illustrated as a one-column table):

+------+
|  ?ep |
+------+
| ex:1 |
| ex:2 |
|      |
| ex:3 |
+------+

..but this is not a correct result for the provenance query---there is no way the third row with the empty mapping would ever occur in such a result.

So, my conclusion is that, for the FedShop RSAs, there won't be such "extra tuples" when using UNDEF and changing SERVICE to SERVICE SILENT (and there is no need for FILTER(BOUND(..))).

mhoangvslev commented 3 days ago

@hartig Thank you for the clarification. I agree with everything you say. What bothers me now is why Virtuoso returns an empty mapping while executing the provenance query 🧐.

hartig commented 3 days ago

Can you post an example of such a provenance query for which Virtuoso returns an empty mapping.

mhoangvslev commented 1 day ago

SELECT DISTINCT ?bgp1 ?bgp2 WHERE { { GRAPH ?bgp1 { ?product rdfs:label ?label .

const!* bsbm:ProductType630

        ?product rdf:type ?localProductType .
        ?localProductType owl:sameAs bsbm:ProductType630 .
        # const!* bsbm:ProductFeature19851
        ?product bsbm:productFeature ?localProductFeature1 .
        ?localProductFeature1 owl:sameAs bsbm:ProductFeature19851.
        # const** bsbm:ProductFeature19019 != bsbm:ProductFeature19851
        ?product bsbm:productFeature ?localProductFeature2 .
        ?localProductFeature2 owl:sameAs bsbm:ProductFeature19019.
        ?product bsbm:productPropertyTextual1 ?propertyTextual .
        ?product bsbm:productPropertyNumeric1 ?p1 .
        # const** "901.0"^^xsd:double < ?p1
        FILTER ( ?p1 > "901.0"^^xsd:double )
    } 
} UNION {
    GRAPH ?bgp2 {
        ?product rdfs:label ?label .
        # const!* bsbm:ProductType630
        ?product rdf:type ?localProductType .
        ?localProductType owl:sameAs bsbm:ProductType630 .
        # const!* bsbm:ProductFeature19851
        ?product bsbm:productFeature ?localProductFeature1 .
        ?localProductFeature1 owl:sameAs bsbm:ProductFeature19851 .
        # const* bsbm:ProductFeature25702 != bsbm:ProductFeature19019, bsbm:ProductFeature19851
        ?product bsbm:productFeature ?localProductFeature3 .
        ?localProductFeature3 owl:sameAs bsbm:ProductFeature25702 .
        ?product bsbm:productPropertyTextual1 ?propertyTextual .
        ?product bsbm:productPropertyNumeric2 ?p2 .
        # const "519.0"^^xsd:double < ?p2
        FILTER ( ?p2 > "519.0"^^xsd:double ) 
    } 
}

}

ORDER BY ?bgp1 ?bgp2

OFFSET 5

LIMIT 10



- Executing this provenance query on Virtuoso gives the following results:

| bgp1                       | bgp2                       |
|----------------------------|----------------------------|
| http://www.ratingsite2.fr/ |                            |
|                            | http://www.ratingsite2.fr/ |

- From the [specs](https://www.w3.org/TR/rdf-sparql-query/#alternatives):
> This will return results with the variable `?bgp1` bound for solutions from the left branch of the UNION, and `?bgp2` bound for the solutions from the right branch. If neither part of the UNION pattern matched, then the graph pattern would not match.

My understanding: as long as you have two different projection variables in two side of `UNION`, you will always have rows with solution for left side then rows with solution for right side (see the table and the spec).  If true, then there will always be `UNDEF` in the `VALUES`.

So I think we can settle with @afs solution, in two possible manners:
1. `FILTER(BOUND(...))` on projection variables after `SERVICE`.
2. Move this logic to FedShop, i.e, drop the empty rows while comparing results between engines.

The first approach is more correct from SPARQL point of view but can only be done in FedShop v2 where there is a actual query parser/rewriter.
hartig commented 1 day ago

@mhoangvslev: [...] What bothers me now is why Virtuoso returns an empty mapping while executing the provenance query 🧐.

@hartig: Can you post an example of such a provenance query for which Virtuoso returns an empty mapping.

@mhoangvslev: Given the provenance query below: [...] Executing this provenance query on Virtuoso gives the following results:

bgp1 bgp2
http://www.ratingsite2.fr/
http://www.ratingsite2.fr/

But this query result does not contain the empty mapping. So, the query that you provide now is actually not one of these provenance queries that you mention in your quote above, and that I asked you to provide an example of. Do you have a provenance query for which "Virtuoso returns an empty mapping" as you say above?

My understanding: as long as you have two different projection variables in two side of UNION, you will always have rows with solution for left side then rows with solution for right side (see the table and the spec). If true, then there will always be UNDEF in the VALUES.

Yes, this observation is correct.

So I think we can settle with @afs solution, in two possible manners:

  1. FILTER(BOUND(...)) on projection variables after SERVICE.
  2. Move this logic to FedShop, i.e, drop the empty rows while comparing results between engines.

The first approach is more correct from SPARQL point of view but can only be done in FedShop v2 where there is a actual query parser/rewriter.

Out of these two options, the first one is the only correct one in my opinion, because the RSA queries should be self-contained. That is, it should be possible to execute them also without the FedShop tooling and still get the correct result for each of them.

mhoangvslev commented 1 day ago

Correction, there is no "empty solution mapping". What I meant was partial solution mapping, or however one can see by looking at the table.

hartig commented 1 day ago

Okay, then we are in agreement I think. The TODO is that all the RSA queries that currently have the nan-based URI in their respective VALUES clause need to be changed such that, for each of these queries: