RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.17k stars 556 forks source link

Invalid serialization of xsd:decimal to scientific notation #1314

Closed wagonhelm closed 3 years ago

wagonhelm commented 3 years ago

I'm having an issue when doing a construct query using SPARQL is returning a rdflib graph with small numbers casted as a literal in scientific notation.

from SPARQLWrapper import SPARQLWrapper, RDFXML
from rdflib.term import URIRef, Literal
import sys

endpoint_url = "https://query.wikidata.org/sparql"
user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
sparql.setReturnFormat(RDFXML)
query = """CONSTRUCT { ?s ?p ?o } WHERE { VALUES ?s { wd:Q184832 } ?s ?p ?o }"""
sparql.setQuery(query)
results = sparql.query().convert()

triples = set(results.triples((None, URIRef("http://www.wikidata.org/prop/direct/P2201"), None)))

for triple in triples:
    (s, p, o) = triple
    print("triple = %s", triple)
    print("str(o) = %s", str(o))
    print("o.value = %s/%s", type(o.value), o.value)
    print("o.n3() = %s", o.n3())

and it's output

triple = %s (rdflib.term.URIRef('http://www.wikidata.org/entity/Q184832'), rdflib.term.URIRef('http://www.wikidata.org/prop/direct/P2201'), rdflib.term.Literal('1E-31', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#decimal')))
str(o) = %s 1E-31
o.value = %s/%s <class 'decimal.Decimal'> 1E-31
o.n3() = %s "1E-31"^^<http://www.w3.org/2001/XMLSchema#decimal>

I believe o.value should be class 'float'

If you serialize it this

results.serialize(destination="wikidump-000000001.ttl", format="turtle")

with open("wikidump-000000001.ttl","r") as fp:
    for line in fp:
        if "wdt:P2201" in line:
            print(line)

It outputs:

    wdt:P2201 1E-31.0 ;
aucampia commented 3 years ago

Ideally questions like this should be asked in stackoverflow, unless it is a bug. In this case you can change the format for serialization by using rdflib.term.bind as follows:

        endpoint_url = "https://query.wikidata.org/sparql"
        user_agent = "LINCS-https://lincsproject.ca//%s.%s" % (
            sys.version_info[0],
            sys.version_info[1],
        )
        sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
        sparql.setReturnFormat(RDFXML)
        query = """CONSTRUCT { ?s ?p ?o } WHERE { VALUES ?s { wd:Q184832 } ?s ?p ?o }"""
        sparql.setQuery(query)

        rdflib.term.bind(
            XSD.decimal,
            decimal.Decimal,
            constructor=decimal.Decimal,
            lexicalizer=lambda val: f"{val:f}",
            datatype_specific=True,
        )

        results = sparql.query().convert()
        triples = set(
            results.triples(
                (None, URIRef("http://www.wikidata.org/prop/direct/P2201"), None)
            )
        )
        for triple in triples:
            (s, p, o) = triple
            logging.info("triple = %s", triple)
            logging.info("str(o) = %s", str(o))
            logging.info("o.value = %s/%s", type(o.value), o.value)
            logging.info("o.n3() = %s", o.n3())

Note bind comes before decode.

Full working example here: https://gitlab.com/aucampia/contrib/rdflib/-/blob/master/tests/test_issues.py#L13

Output:

datatype 'http://www.w3.org/2001/XMLSchema#decimal' was already bound. Rebinding.
triple = (rdflib.term.URIRef('http://www.wikidata.org/entity/Q184832'), rdflib.term.URIRef('http://www.wikidata.org/prop/direct/P2201'), rdflib.term.Literal('0.0000000000000000000000000000001', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#decimal')))
str(o) = 0.0000000000000000000000000000001
o.value = <class 'decimal.Decimal'>/1E-31
o.n3() = "0.0000000000000000000000000000001"^^<http://www.w3.org/2001/XMLSchema#decimal>

Please remember to close the issue

wagonhelm commented 3 years ago

This appears to fix my problem, thankyou @aucampia. Originally I wasn't really sure if it was a bug or issue.

aucampia commented 3 years ago

@wagonhelm My appologies, I think this is actually a bug. What you wrote in the description made me think you were just looking for a way to achieve something ("I need them in decimal value. How could I go about doing this?") and I did not actually check if the behaviour is correct.

https://www.w3.org/TR/xmlschema11-2/#decimal

The lexical space of decimal is the set of lexical representations which match the grammar given above, or (equivalently) the regular expression

(\+|-)?([0-9]+(\.[0-9]*)?|\.[0-9]+)

1E-31 does indeed not match that regex so a fix is needed.

Please re-open it.

wagonhelm commented 3 years ago

@aucampia, I re-opened, though I do not thoroughly understand the issue or the library and fear I'm not using the right words. Ultimately my issue was that when I tried to load the resulting turtle using Wikibase / WDQS I would get an error with any line with scientific notation. When reloading the .ttl using rdflib it results in the following:

triple = %s (rdflib.term.URIRef('http://www.wikidata.org/entity/Q184832'), rdflib.term.URIRef('http://www.wikidata.org/prop/direct/P2201'), rdflib.term.Literal('1e-31', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#double')))
str(o) = %s 1e-31
o.value = %s/%s <class 'float'> 1e-31
o.n3() = %s "1e-31"^^<http://www.w3.org/2001/XMLSchema#double>

The floatRep production is equivalent to this regular expression (after whitespace is removed from the regular expression):

(\+|-)?([0-9]+(\.[0-9]*)?|\.[0-9]+)([Ee](\+|-)?[0-9]+)?
|(\+|-)?INF|NaN

The ·value space· of double contains the non-zero numbers m × 2e , where m is an integer whose absolute value is less than 253, and e is an integer between −1074 and 971, inclusive.

aucampia commented 3 years ago

@wagonhelm

"1e-31" is completely valid for xsd:double - but invalid for xsd:decimal

If I run CONSTRUCT { ?s ?p ?o } WHERE { VALUES (?s ?p) { (wd:Q184832 <http://www.wikidata.org/prop/direct/P2201>) } ?s ?p ?o } against Wikidata it returns this:

curl  --silent 'https://query.wikidata.org/sparql' \
  --header "Accept: application/n-triples" \
  --data-urlencode 'query=CONSTRUCT { ?s ?p ?o } WHERE { VALUES (?s ?p) { (wd:Q184832 <http://www.wikidata.org/prop/direct/P2201>) } ?s ?p ?o }'
<http://www.wikidata.org/entity/Q184832>
  <http://www.wikidata.org/prop/direct/P2201> 
  "0.0000000000000000000000000000001"^^<http://www.w3.org/2001/XMLSchema#decimal> .

For me RDFLib formats that as 1E-31 unless I first do:

        rdflib.term.bind(
            XSD.decimal,
            decimal.Decimal,
            constructor=decimal.Decimal,
            lexicalizer=lambda val: f"{val:f}",
            datatype_specific=True,
        )

This is a workaround to a real issue, and I would describe the real issue being worked around here as "Invalid serialization of xsd:decimal to scientific notation".

If you prefer to format xsd:double to be serialized as decimal instead of scientific notation then that is user preference, and the right solution there would be to use rdflib.term.bind as follow (note XSD.double instead of XSD.decimal, also I did not test this so it may be wrong):

        rdflib.term.bind(
            XSD.double,
            float,
            constructor=float,
            lexicalizer=lambda val: f"{val:f}",
            datatype_specific=True,
        )

I cannot guarantee this would be compliant with XSD though, but I have no specific reason to doubt it would be non-compliant. It's just user beware.

aucampia commented 3 years ago

Just noticed a mistake in my last comment and corrected it. I will make a fix for it once #1315 is merged.