BurntSushi / xsv

A fast CSV command line toolkit written in Rust.
The Unlicense
10.35k stars 321 forks source link

math precision problems in `xsv stats` #100

Open vielmetti opened 7 years ago

vielmetti commented 7 years ago

Input data set, "iris.csv" from csvkit, a copy which is at https://gist.github.com/447b7d1ef6ef79add6084272e91dcf83

What I got:

$ xsv stats iris.csv
field,type,sum,min,max,min_length,max_length,mean,stddev
sepal_length,Float,876.5000000000002,4.3,7.9,3,3,5.843333333333333,0.8253012917851416
sepal_width,Float,458.60000000000014,2,4.4,3,3,3.0573333333333315,0.4344109677354942
petal_length,Float,563.7000000000004,1,6.9,3,3,3.758,1.759404065775303
petal_width,Float,179.90000000000012,0.1,2.5,3,3,1.1993333333333336,0.7596926279021594
species,Unicode,,Iris-setosa,Iris-virginica,11,15,,

What I expected:

Field type "Sum" more exact.

hmenke commented 7 years ago

That is floating point math. Nothing to do there.

See also this Stack Overflow thread: Is floating point math broken?

BurntSushi commented 7 years ago

Sorry I didn't see this earlier. I agree that @hmenke is probably right here. Regardless, this issue doesn't have enough detail to be actionable. It's not clear what "I expected 'sum' to be more exact" actually means. What problem are you trying to solve?

vielmetti commented 7 years ago

I expected that xsv stats, which says it's a statistics command, would display precisely 876.5 in the sum field for sepal_length given this data set. If the underlying native Rust math libraries are not sufficiently conditioned to avoid reporting spurious precision, then I'd suggest looking for some that do a better job.

BurntSushi commented 7 years ago

Oh I see. In that case, it seems like @hmenke is correct. This has nothing to do with "Rust math libraries" and everything to do with floating point arithmetic.

One possible fix I can imagine to this is that xsv records the number of significant digits in the input, and makes sure the output has the same number of significant digits. Patches are welcome, although this does seem rather unpleasant to implement!

If the underlying native Rust math libraries are not sufficiently conditioned to avoid reporting spurious precision, then I'd suggest looking for some that do a better job.

This isn't actually helpful. Example: "If xsv is not sufficiently conditioned to avoid reporting spurious precision, then I'd suggest looking for some other tool that does a better job."

vielmetti commented 7 years ago

I recognize that floating point is not an easy problem. (Ask me some time about doing financial math in Tcl.)

I am aware of tools that do a better job specifically in the realm of producing summary statistics for CSV files; csvstat from csvkit in particular prints the correct answers on the same data set. So I know it's possible.

Looking at csvkit it appears that it uses the Python agate library, which stores numbers in a Decimal format rather than as float.

http://agate.readthedocs.io/en/1.6.0/cookbook/compute.html?highlight=decimal

and indeed looking at Rust libraries I see

https://github.com/alkis/decimal

as a possible compatible library for representing numbers in a format that doesn't generate the evident and completely avoidable errors that floating point arithmetic does.

BurntSushi commented 7 years ago

@vielmetti Thanks for those tips! It sounds like this might be something worth experimenting with, although I probably won't do it any time soon myself.

Yomguithereal commented 6 years ago

Sorry if I'm missing the point here, but can't we improve the float sum exactedness a little by using Kahan summation here (or Kahan-Babuska, even faster)? I think I could add a PR to add it if you want. It's not a silver bullet and clearly less exact than using decimal types but it could be a compromise.

BurntSushi commented 6 years ago

@Yomguithereal It's not clear to me that the algorithm for summation is the problem, but rather, the representation of decimal numbers as floats. You are more than welcome to experiment, but please keep in mind that I'll likely reject anything that increases the maintenance burden.