apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.47k stars 3.52k forks source link

[JS] Fully null column of type `Bool` produces incompatible IPC stream with JS package #39776

Open csjh opened 9 months ago

csjh commented 9 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Hello! Continuing from kylebarron/arrow-wasm#58.

Using tableToIPC on a table with a Bool column that is fully null produces bytes that are incompatible with arrow-rs and pyarrow (presumably others, but those are the tested ones).

Reproduction:

import {
    vectorFromArray,
    Table,
    Bool,
    tableToIPC
} from 'apache-arrow';

const table = new Table({ x: vectorFromArray([null, null], new Bool()) });
table.schema.fields[0].nullable = true;

console.log(tableToIPC(tab).join(', '));
import pyarrow as pa

reader = pa.ipc.open_stream(bytes([<bytes from the serialized table>]))
table = reader.read_all()

# x: [<Invalid array: Buffer #1 too small in array of type bool and length 2: expected at least 1 byte(s), got 0>]
print(table)

It appears to be a Javascript-only bug, as there are 2 pyarrow null Bool column examples that work fine.

Version: 15.0.0

Component(s)

JavaScript

domoritz commented 7 months ago

Thanks for the issue. Can you try to send a pull request?

csjh commented 6 months ago

Sorry, I'm not too familiar with the arrow serialization. I can give it a shot though - are there any general overviews available of the tableToIPC logic (for the JS codebase)?

domoritz commented 6 months ago

The closest is @trxcllnt's talk from a few years ago: https://docs.google.com/presentation/d/17pFpCVbRpZJPKZbGMU4yOeGVkfrDlbK5n2ZdtDa20is/edit#slide=id.gc888a5c5c8_1_0. Everything else would be non-js specific.

trxcllnt commented 6 months ago

From the error, seems like this optimization may be to blame.