ben-strasser / fast-cpp-csv-parser

fast-cpp-csv-parser
BSD 3-Clause "New" or "Revised" License
2.09k stars 435 forks source link

Loss of precision on float reading #122

Open Etendard7 opened 3 years ago

Etendard7 commented 3 years ago

Hi,

Thank you for this library !

I observed a surprising anomaly when reading some float numbers (as double).

If we take this csv, for example, and read the first column as a double :

Col1,Col2 9.855536508,0

The first value will be read as 9.8555365080000054 It should be read as 9.8555365080000001

The second one is the value you get with other parsers. It is also the value you get if you do double x = 9.855536508;

For my application, this has an impact that leads to a bigger bug.

Do not hesitate to ask me for further tests if you wish.

paulharris commented 3 years ago

I suspect you are looking at noise.

Quoted text : Addressing examples

A double normally stores 16 (some might argue 17) significant decimal digits. MSVC is processing 17 digits. Anything beyond that is noise. GCC is doing as you ask it, but there aren't enough bits in a double to warrant the extra 14 digits you are requesting. If you had 16-byte 'long double' values (SPARC, PPC, Intel x86_64 for Mac), then you might warrant 32 significant figures. However, the differences you are showing are QoI; I might even argue that MS is doing a better job than GCC/glibc here (and I don't often say that!).

From here :

https://stackoverflow.com/questions/2499329/strtod-and-sprintf-inconsistency-under-gcc-and-msvc/6988597#6988597

On Mon., 12 Jul. 2021, 3:26 am Etendard7, @.***> wrote:

Hi,

Thank you for this library !

I observed a surprising anomaly when reading some float numbers (as double).

If we take this csv, for example, and read the first column as a double :

Col1,Col2 9.855536508,0

The first value will be read as 9.8555365080000054 It should be read as 9.8555365080000001

The second one is the value you get with other parsers. It is also the value you get if you do double x = 9.855536508;

For my application, this has an impact that leads to a bigger bug.

Do not hesitate to ask me for further tests if you wish.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ben-strasser/fast-cpp-csv-parser/issues/122, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMSASY4H4UUAT5JRURFAJLTXHV5BANCNFSM5AFSPI3A .

ben-strasser commented 3 years ago

Hi, thanks for the bug report.

Currently, the float parsing is handled using

 x = 0;
while('0' <= *col && *col <= '9'){
        int y = *col - '0';
        x *= 10;
        x += y;
        ++col;
}

if(*col == '.'|| *col == ','){
        ++col;
        T pos = 1;
        while('0' <= *col && *col <= '9'){
                pos /= 10;
                int y = *col - '0';
                ++col;
                x += y*pos;
        }
}

Where x is a float, double or long double. This code contains quite a few floating point computations that can introduce small errors.

I think this can be improved by parsing the floating first as integer plus some exponent and then doing a cast and a single division. This means that "1234.789" would be parsed as integer "1234789" and exponent "-3". One would then do something along float(1234789)*pow(10,-3).

Would parsing 9.855536508 by computing double(9855536508ll)*pow(10,-9) get you the result you need?

For float, this should work when parsing the digits first as int64. To correctly parse double and long double I think that more than a 64 bit integer are necessary. It is unclear to me what the best course of action is here. Maybe parsing it using multiple int64 is a good idea. For example, one could parse 456468484554529.8454545546684682555365 as double(456468484554529845ll)*pow(10,-3) + double(4545546684682555365ll)*pow(10,-22)

Important: pow is slow. A faster computation of powers of 10 is necessary. pow is just for exposition.

Etendard7 commented 2 years ago

Hi,

Thank you both for these explanations.

I confirm that computing double(9855536508ll)*pow(10,-9) returns the value I expect. Compared to the value returned by the parser, it is closer to the theoretical value by a factor of 50.

I guess there is a trade-off between precision and speed of parsing.

ShaharHD commented 2 weeks ago

Just encountered the exact same issue.

I've used std::string to read and then use sscanf with %le for double

But is there a better way by providing a custom "decoder"?