data-forge / data-forge-ts

The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
http://www.data-forge-js.com/
MIT License
1.34k stars 77 forks source link

[Discussion] Floating point precision #104

Open MichielDeMey opened 3 years ago

MichielDeMey commented 3 years ago

Hi

We're currently using data-forge for a fintech project and we're loving it! Current usage is to aggregate a list of financial transactions by month and create a sum, basically giving us a time-series representation of our data.

This works great, however JavaScript is notorious for (incorrectly) handling floating points and losing precision in the process. Unfortunately, when working with financial data you'd need that floating point precision sweetness.

Relevant snippet:

const result = data.groupBy((row) => moment(row.due_date).startOf(aggregation).format("YYYY-MM"))
  .select((group) => ({
    type: group.first().type,
    due_date: group.first().due_date,
    amount: group.deflate((row) => row.amount).sum()
  }))

Which will return something like

__index__   type     due_date    amount             
----------  -------  ----------  -------------------
2020-01-01  payable  2020-01-01  -300046.7899999998 
2020-02-01  payable  2020-02-01  -399881.57999999955
2020-03-01  payable  2020-03-01  -4724.579999999969 
2020-04-01  payable  2020-04-01  29564.520000000066 
2021-01-01  payable  2021-01-01  1750               

As you can see, the floating point precision is not very favourable. Looking at the code, the sum function does exactly what you'd expect as it just sums up two number using plain JavaScript.

https://github.com/data-forge/data-forge-ts/blob/master/src/lib/series.ts#L4011-L4019

Floating point precision has been an issue for a long time and have been solved by several libraries already. I noticed this library already using numeral, which can handle floating point precision.

With that in mind, we wrote a custom aggregation function to use numeral to fix our floating point precision.

const result = data.groupBy((row) => moment(row.due_date).startOf(aggregation).format("YYYY-MM"))
  .select((group) => ({
    type: group.first().type,
    due_date: group.first().due_date,
    amount: group
      .deflate((row) => row.amount)
      .aggregate((prev: number, value: number) =>
        numeral(prev).add(value).value() // Use Numeral to add two numbers, returning the resulting value
      )
  }))

And correctly returns

__index__   type     due_date    amount    
----------  -------  ----------  ----------
2020-01-01  payable  2020-01-01  -300046.79
2020-02-01  payable  2020-02-01  -399881.58
2020-03-01  payable  2020-03-01  -4724.58  
2020-04-01  payable  2020-04-01  29564.52  
2021-01-01  payable  2021-01-01  1750      

I created a separate CodeSandbox with the actual code so you can play with it: https://codesandbox.io/s/eloquent-elbakyan-16v5i?file=/src/App.tsx


With this in mind, I'd like to open up the discussion of handling floating point precision in data-forge.

The way I see it, we can either

a) Provide support for floating point precision using numeral This would improve the developer experience of this library as floating point precision "just works". (For the manipulation functions supported by numeral (add, subtract, multiply, divide))

e.g.

amount: group.deflate((row) => row.amount).sum(true) // pass in "true" to allow floating point precision

or a separate util function

amount: group.deflate((row) => row.amount).sumPrecise()

b) Clearly document that data-forge does not handle floating point precision out-of-the-box and provide an example on how to keep floating point precision using numeral or another library (numbro, bigdecimal.js etc.).

Curious to hear your thoughts about this!

ashleydavis commented 3 years ago

This is awesome thanks so much for bringing it up.

I have actually had to deal with this before (in multiple languages actually not just JavaScript).

For JavaScript, I've previously used Decimal.js, but Numeral is also good.

I've wanted to fix this directly in Data-Forge for a few years now. I'm just not sure the best way to go about it.

Maybe what it needs is some kind of plugin to DF that (when supplied) can handle the arithmetic. Then we could plugin Numeral and Decimal.js. Or maybe a more generic statics plugin so that we can override all the statistical capabilities to DF with customized versions.

What do you think? Any better ideas?

Would you be interested in working on this?

If it were backward compatible it could be included in an update to version 2 of DF.