apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

[JS] Decimal-type conversions result in errors/inaccuracies in Arrow JS #37920

Open jheer opened 1 year ago

jheer commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

Thanks for all the work on Arrow! Unfortunately, I've run into problems in the JS API around both encoding and decoding of Decimal-typed data.

Encoding:

arrow = require('apache-arrow')
arrow.vectorFromArray([1, 12, 34], new arrow.Decimal(3, 18))

Results in TypeError: d.subarray is not a function

(Incidentally it also seems that the Decimal type constructor uses a different argument order than pyarrow.)

Decoding:

Since encoding failed in JS, let's do it in pyarrow:

import pyarrow as pa
v = pa.array([1, 12, 34], type=pa.decimal128(18, 3))
batch = pa.record_batch([v], names=['d'])

sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, batch.schema) as writer:
   writer.write_batch(batch)
sink.getvalue().hex()

and then we can carry the resulting byte string over to JavaScript and try to decode it:

const hex = 'FFFFFFFF780000001000000000000A000C000600050008000A000000000104000C000000080008000000040008000000040000000100000014000000100014000800060007000C00000010001000000000000107100000001C0000000400000000000000010000006400000008000C0004000800080000001200000003000000FFFFFFFF8800000014000000000000000C0016000600050008000C000C0000000003040018000000300000000000000000000A0018000C00040008000A0000003C00000010000000030000000000000000000000020000000000000000000000000000000000000000000000000000003000000000000000000000000100000003000000000000000000000000000000E8030000000000000000000000000000E02E0000000000000000000000000000D0840000000000000000000000000000FFFFFFFF00000000';
const bytes = Uint8Array.from(hex.match(/.{1,2}/g).map(s => parseInt(s, 16)));
const vec = arrow.tableFromIPC(bytes).getChild('d');

vec.get(0)         // [1000, 0, 0, 0] <- byte array looks correct
`${vec.get(0)}`    // "1000" <- incorrect, should be 1; ignores scale
Number(vec.get(0)) // 2.2136092888451462e+23 <- incorrect, should be 1
+vec.get(0)        // TypeError: can't convert BigInt to number <- ???

While the basic byte encoding looks OK, all of the built-in conversions appear to be faulty.

Related: #28804, #35745 cc: @domoritz

Component(s)

JavaScript

dioptre commented 7 months ago

Whats the status on this? Thanks!

Working with actual decimals (not ints - looks v broken, ex. decimal(38,9))

trxcllnt commented 7 months ago

Converting from 64 and 128-bit decimals to 64-bit floats isn't implemented yet. I don't have the bandwidth to do it currently, but PR's are welcome. libcudf's C++ implementation would be a great starting point: https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/include/cudf/fixed_point/fixed_point.hpp

jheer commented 7 months ago

I wrote our own converter for Mosaic if it helps: https://github.com/uwdata/mosaic/blob/main/packages/core/src/util/convert-arrow.js#L132

dioptre commented 7 months ago

Thanks guys, we wrote our own yesterday too. Will push a PR shortly. Looks like theres a bug in your code too @jheer will update soon.

mbostock commented 7 months ago

We just ran into this using DuckDB-Wasm. Here’s a simple query that demonstrates the problem:

SELECT 11111111100::INT128 AS n

This returns a DecimalBigNum consisting of

[2521176508, 2, 0, 0]

And likewise string coercion returns the expected "11111111100", but number coercion gives 4.460189873590169e+44.

domoritz commented 7 months ago

https://github.com/apache/arrow/pull/40729 starts to fix decimal support. More to come.

domoritz commented 6 months ago

The next step is to automatically apply the scaling of the type when one calls get on an arrow vector. We probably want to create a new BigNum type and refactor what we have in that file (some of it is left over from before BigInt was widely available in js).