apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

[JS] Regression in roundtripping JS Date objects from apache-arrow@15 to apache-arrow@16 #43193

Open marvold-mw opened 4 months ago

marvold-mw commented 4 months ago

Describe the bug, including details regarding any error messages, version, and platform.

The Create a Table from JavaScript arrays example from the documentation is sufficient:

import { tableFromArrays } from 'apache-arrow';

const LENGTH = 2000;

const rainAmounts = Float32Array.from(
    { length: LENGTH },
    () => Number((Math.random() * 20).toFixed(1)));

const rainDates = Array.from(
    { length: LENGTH },
    (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i));

const rainfall = tableFromArrays({
    precipitation: rainAmounts,
    date: rainDates
});

console.table([...rainfall]);

When running this code with apache-arrow@15.0.2 installed, the date column in the resulting table is a column of JavaScript Date objects. With apache-arrow@16.0.0 installed, it is a column of JavaScript numbers representing epoch timestamps. This is a surprising result, I think; while there's no reason why Arrow would be required to map Date objects to something which maps back to Date objects, it is an unusual choice which is not documented and which has clearly been changed recently.

Tested under Node 20.15.0 on Apple Silicon, but I have no reason to believe this is platform-specific.

PR #40892 appears to have gone through some iteration after I tested it in https://github.com/duckdb/duckdb-wasm/issues/1231, and these follow-up changes caused this regression. Perhaps it's desirable—but I'm not quite convinced this was the correct outcome? Feel free to close as by design (but I think it would be worth documenting all of the. type mappings going from JS to Arrow and back, at the very least).

Component(s)

JavaScript

trxcllnt commented 3 months ago

Dates are relatively expensive to create in node, so the decision was made in #40892 to always return the numeric representation, and allow users to construct Dates/handle timezones themselves, i.e. new Date(ts.get(0)).

This strikes a balance between API consistency and performance, as without it, users who didn't wish to incur the cost of constructing Date instances would need to manually enumerate the raw buffers in each underlying data instance (and handle null checks etc. manually):

ts.data.forEach(chunk => chunk.values.forEach(num => /* compute w/ timestamp */))

while there's no reason why Arrow would be required to map Date objects to something which maps back to Date objects

The Vector creation convenience functions still accept Arrays of Date instances because they infer the dtype if none is explicitly passed.