RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.18k stars 558 forks source link

Turtle / Trig lose precision serializing XSD.double #1852

Open jdogburck opened 2 years ago

jdogburck commented 2 years ago

Serializing objects typed as XSD.double to Turtle (and it's derived Trig) format lose precision after the 6th decimal point in the serialized exponential format. This seems to be a rehash of issue #237 closed in 2013 although the horizon has moved a bit.

I think to it's from a "%e" format in Literal._literal_n3() as used by the TurtleSerializer which forces use_plain=True. The issue isn't present for untyped or Decimal types.

The use case doesn't lend itself to changing the representation of ingested input graphs except, possibly in the scenario where the precision may exceed IEEE-754 64-bit.

thanks JB

>>> from rdflib import Graph, Literal, XSD, Namespace
>>>
>>> double_ = Literal('11.23456789', datatype=XSD.double)
>>>
>>> ns = Namespace('http://test/')
>>> g = Graph()
>>> g.add((ns.uri1, ns.p_double, double_))
<Graph identifier=N9062490911684d988fe499b4a195a0d4 (<class 'rdflib.graph.Graph'>)>
>>>
>>> print(float(double_))
11.23456789
>>> print(g.serialize(format='turtle'))
@prefix ns1: <http://test/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ns1:uri1 ns1:p_double 1.123457e+01 .
kone807 commented 2 years ago

based on issue #237, I made a minor change to get the expected output. In term.py, if we simply change -

if self.datatype == _XSD_DOUBLE:
                    return sub("\\.?0*e", "e", "%e" % float(self)) 

to

if self.datatype == _XSD_DOUBLE:
                    return str(float(self.value))#sub("\\.?0*e", "e", "%e" % float(self)) 

will this be sufficient?

ghost commented 2 years ago

based on issue #237, I made a minor change to get the expected output ... will this be sufficient?

Yes, if you primarily care about precision but be aware that the edit causes a change in datatype from XSD.double to XSD.decimal. As mentioned by @jdogburck, this known limitation was discussed by gromgull and the use of XSD.decimal recommended:

In general, you cannot expect the precision of doubles or floats to be kept - since floating point representation simply isn't precise ... A related gotcha with floating point is that number of significant digits are not kept ... if you care about either of these, use XSD.decimals, which are mapped to python objects using the decimal module

The limit of precision is mentioned in (aka “buried in”) the docstring for Literal._literal_n3():

# Only limited precision available for floats:
>>> Literal(0.123456789)._literal_n3(use_plain=True)
u'1.234568e-01'

and is inherited from Python string formatting of e: “With no precision given, uses a precision of 6 digits after the decimal point for float

So, yes, the limiting LoC is:

https://github.com/RDFLib/rdflib/blob/24d607015d0d7a759785afcce54bd66c94be8f94/rdflib/term.py#L1376

and I had a play around with setting the precision:

diff --git a/rdflib/term.py b/rdflib/term.py
index c0ea4a28..01d788cc 100644
--- a/rdflib/term.py
+++ b/rdflib/term.py
@@ -1357,6 +1357,16 @@ class Literal(Identifier):
             u'"1"^^xsd:integer'

         """
+
+        # https://stackoverflow.com/a/6098154
+        # used to determine number of precise digits in a string
+        def __get_precision(str_value):
+            vals =  str_value.split('.')
+            if (vals[0] == '0'):
+                return len(vals[1])
+            else:
+                return len(str_value) - 1
+
         if use_plain and self.datatype in _PLAIN_LITERAL_TYPES:
             if self.value is not None:
                 # If self is inf or NaN, we need a datatype
@@ -1373,7 +1383,11 @@ class Literal(Identifier):
                 # in py >=2.6 the string.format function makes this easier
                 # we try to produce "pretty" output
                 if self.datatype == _XSD_DOUBLE:
-                    return sub("\\.?0*e", "e", "%e" % float(self))
+                    if str(self).split(".")[1] > "0":
+                        rval = '{:.{}e}'.format(float(self), __get_precision(self) - 1)
+                    else:
+                        rval = sub("\\.?0*e", "e", "%e" % float(self))
+                    return rval
                 elif self.datatype == _XSD_DECIMAL:
                     s = "%s" % self
                     if "." not in s and "e" not in s and "E" not in s:

using this test harness:

def test_literal_n3():
    assert Literal(1)._literal_n3(use_plain=True) == '1'
    assert Literal(1.0)._literal_n3(use_plain=True) == '1e+00'
    assert Literal(1.0, datatype=XSD.decimal)._literal_n3(use_plain=True) == '1.0'
    assert Literal(1.0, datatype=XSD.float)._literal_n3(use_plain=True) == '"1.0"^^<http://www.w3.org/2001/XMLSchema#float>'
    assert Literal("foo", datatype=XSD.string)._literal_n3(use_plain=True) == '"foo"^^<http://www.w3.org/2001/XMLSchema#string>'
    assert Literal(True)._literal_n3(use_plain=True) == 'true'
    assert Literal(False)._literal_n3(use_plain=True) == 'false'
    assert Literal(1.91)._literal_n3(use_plain=True) == '1.91e+00'
    # Only limited precision available for floats:
    Literal(0.123456789)._literal_n3(use_plain=True) == u'1.234568e-01'
    Literal('0.123456789', datatype=XSD.decimal)._literal_n3(use_plain=True) == '0.123456789'

@pytest.mark.parametrize(
    "val",
    [
        # From http://www.datypic.com/sc/xsd/t-xsd_double.html

        # xsd               str(float(xsd))         literal_n3

        ("-3E2",            "-300.0",               '-3e+02'),
        ("4268.22752E11",   "426822752000000.0",    '4.268228e+14'),
        # ("+24.3e-3",        "0.0243",               '2.43e-02'),  # before
        ("+24.3e-3",        "0.0243",               '2.430e-02'),
        # ("0.123456789",     "0.123456789",          '1.234568e-01'), # before
        ("0.123456789",     "0.123456789",          '1.23456789e-01'),
        ("12",              "12.0",                 '1.2e+01'),
        ("+3.5",            "3.5",                  '3.5e+00'),  # any value valid for decimal is also valid for xsd:double
        ("-INF",            "-inf",                 '"-INF"^^<http://www.w3.org/2001/XMLSchema#double>'),  # negative infinity
        ("-0",              "-0.0",                 '-0e+00'),  # 0
        ("NaN",             "nan",                  '"NaN"^^<http://www.w3.org/2001/XMLSchema#double>'),  # Not a Number

        # Invalid values
        # -3E2.4 - the exponent must be an integer
        # 12E    - an exponent must be specified if "E" is present
        # NAN    - values are case-sensitive, must be capitalized correctly
        #        - an empty value is not valid, unless xsi:nil is used
    ],
)
def test_xsd_double(val):
    assert Literal(float(val[1]), datatype=XSD.double
        )._literal_n3(use_plain=True) == val[2]

which seems to be the right area of the city in which the ballpark is located, apart from that annoying trailing zero in '2.430e-02'. And for full complicance, it probably should be '{:.{}E}', not '{:.{}e}'

nandikajain commented 2 years ago

@gjhiggins, Can I take up this issue?

jdogburck commented 2 years ago

all, sorry for the long hiatus - vacation...

that said - I agree with @gjhiggins - particularly the discussion of lack of specificity WRT precision and the need to preserve types and not convert from double to literal. for strongly typed literals, should should have similar tests for XSD.decimal and XSD.integer too?

Regarding keeping maximum precision, I was puttering around with something similar to the following for precision but it doesn't handle non-numeric cases like NaN and INF. I'm sure @gjhiggins handles other cases this simple snip misses so probably best to ignore it.

tmp=str(float(self))
return (tmp if 'e' in tmp else f'{tmp}e+00'

All that said, at the heart of my issue is round tripping graphs and reading in expected values which is probably captured here but it would be good to have tests for graph ingest and round tripping of pesky types https://www.w3.org/TR/turtle/#abbrev. maybe with tests similar to:

@pytest.mark.parametrize(
    "val",
    [
        # turtle value, double literal value

        # Simple cases from https://www.w3.org/TR/turtle/#abbrev
        ("4.2E9", "4.2E9"),  # abbreviated
        ('"4.2e9"^^<http://www.w3.org/2001/XMLSchema#double>', "4.2e9"),

        # Expand with more values form http://www.datypic.com/sc/xsd/t-xsd_double.html
        ("+24.3e-3", "0.0243"),
        ("1.23456789e-01", "0.123456789"),
        ("12e-00", "12.0"),
        # ...
    ],
)
# Verify we get the Literal we expect from known Turtle literals
def test_read_double(val):
    turtle = f'<http://test/s> <http://test/p> {val[0]}.'

    g_in = Graph()
    g_in.parse(data=turtle, format='turtle')
    _, _, input = list(g_in)[0]

    # 1 will do if there are tests for these various forms elsewhere
    assert (input == Literal(float(val[1]), datatype=XSD.double))
    assert (input == Literal(float(val[1])))
    assert (input == Literal(val[1], datatype=XSD.double))

# similarly test_read_decimal(val)
# similarly test_read_integer(val)

@pytest.mark.parametrize(
    "val",
    [
        # double literal value

        # Simple cases from https://www.w3.org/TR/turtle/#abbrev
        "4.2E9",  # abbreviated

        # Expand with more values form http://www.datypic.com/sc/xsd/t-xsd_double.html
        "24.3e-3"
        "1.23456789e-01"
        "12e-00"
        # ...
    ],
)
# Verify we can round trip literal values through Turtle serialization / parsing
def test_round_trip_double(val):

    value = Literal(float(val[0]), datatype=XSD.double)

    g_out = Graph()
    g_out.add((URIRef('http://test/s'), URIRef('http://test/p'), value))
    turtle = g_out.serialize(format='turtle', encoding='utf-8')

    g_in = Graph()
    g_in.parse(data=turtle, format='turtle')
    _, _, input = list(g_in)[0]   # cheesy way to grab the literal
    assert (input == value)

# similarly test_round_trip_decimal(val)
# similarly test_round_trip_integer(val)

thanks for the discussion...