RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.15k stars 555 forks source link

Ordering differs between strings and URIs? #1466

Open KonradHoeffner opened 2 years ago

KonradHoeffner commented 2 years ago

The following query produces the same ordering whether I sort between ?suffix or ?uri on the Virtuoso SPARQL endpoint http://hitontology.eu/sparql:

SELECT (REPLACE(STR(?uri),"http://dbpedia.org/resource/","") AS ?suffix) (STR(SAMPLE(?label)) AS ?label)
{
 ?uri a hito:OperatingSystem ;
      rdfs:label ?label.    
 FILTER(LANGMATCHES(LANG(?label),"en")||LANGMATCHES(LANG(?label),""))
}
GROUP BY ?uri
Case 1: ORDER BY ASC(?uri)
Case 2: ORDER BY ASC(?suffix)

However when I perform the same query using RDFLib and when ordering by ?uri I get a different result:

rows = graph.query(query)
print(list(rows))
[(rdflib.term.Literal('AmigaOS_4'), rdflib.term.Literal('AmigaOS 4')), (rdflib.term.Literal('AmigaOS'), rdflib.term.Literal('AmigaOS')), (rdflib.term.Literal('Android_Ice_Cream_Sandwich'), rdflib.term.Literal('Android Ice Cream Sandwich')),...

In the result above, "AmigaOS_4" is sorted before "AmigaOS", although it should come afterwards.

Strangely, this problem goes away when I add ?uri to the SELECT clause:

SELECT ?uri (REPLACE(STR(?uri),"http://dbpedia.org/resource/","") AS ?suffix) (STR(SAMPLE(?label)) AS ?label)
{
 ?uri a hito:OperatingSystem ;
      rdfs:label ?label.
 FILTER(LANGMATCHES(LANG(?label),"en")||LANGMATCHES(LANG(?label),""))
}
GROUP BY ?uri
ORDER BY ASC(?uri)
[(rdflib.term.URIRef('http://dbpedia.org/resource/AmigaOS'), rdflib.term.Literal('AmigaOS'), rdflib.term.Literal('AmigaOS')), (rdflib.term.URIRef('http://dbpedia.org/resource/AmigaOS_4'), rdflib.term.Literal('AmigaOS_4'), rdflib.term.Literal('AmigaOS 4')), (rdflib.term.URIRef('http://dbpedia.org/resource/Android_(operating_system)'),...

Where does this problem originate? Does it treat "rows" as a set with undefined ordering and when converting to a list it uses some RDFLib internal sorting mechanism?

According to https://www.w3.org/TR/sparql11-query/#modOrderBy:

The "<" operator (see the Operator Mapping and 17.3.1 Operator Extensibility) defines the relative order of pairs of numerics, simple literals, xsd:strings, xsd:booleans and xsd:dateTimes. Pairs of IRIs are ordered by comparing them as simple literals.

aishaaijazahmad commented 2 years ago

In the result above, "AmigaOS_4" is sorted before "AmigaOS", although it should come afterwards

Perhaps this is because of the following comment specified in the _QB.py file.

order: URIRef  
# indicates a priority order for the components of sets with this structure, used to guide presentations - lower order numbers come before higher numbers, un-numbered components come last

See algebra.py for more.