apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[Python] [JavaScript] Arrow IPC file output by apache-arrow tableToIPC method cannot be read by pyarrow #31100

Closed asfimport closed 9 months ago

asfimport commented 2 years ago

IPC files created by the node library apache-arrow don't seem to be able to be read by pyarrow. There is an example of this issue here: https://github.com/dancoates/pyarrow-jsarrow-test 

 

writing the arrow file from js


import {tableToIPC, tableFromArrays} from 'apache-arrow';
import fs from 'fs';

const LENGTH = 2000;
const rainAmounts = Float32Array.from(
    { length: LENGTH },
    () => Number((Math.random() * 20).toFixed(1)));

const rainDates = Array.from(
    { length: LENGTH },
    (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i));

const rainfall = tableFromArrays({
    precipitation: rainAmounts,
    date: rainDates
});

const outputTable = tableToIPC(rainfall);
fs.writeFileSync('jsarrow.arrow', outputTable); 

 

reading in python


import pyarrow as pa
with open('jsarrow.arrow', 'rb') as f:
    with pa.ipc.open_file(f) as reader:
        df = reader.read_pandas()
        print(df.head())

 

produces the error:


pyarrow.lib.ArrowInvalid: Not an Arrow file 

 

 

Reporter: Dan Coates

Note: This issue was originally created as ARROW-15642. Please see the migration documentation for further details.

asfimport commented 2 years ago

Weston Pace / @westonpace: You are creating a table in the Arrow IPC streaming format and then attempting to read it using the Arrow IPC file format.

You can either change your JS to write the file format:

https://arrow.apache.org/docs/js/modules/Arrow_dom.html#tableToIPC


const outputTable = tableToIPC(rainfall, 'file');

or you can change your python to read the stream format:


with pa.ipc.RecordBatchStreamReader('jsarrow.arrow') as reader:
    tab = reader.read_all()
    print(tab.to_pandas())
asfimport commented 2 years ago

Dan Coates: Thank you for the explanation @westonpace it is much appreciated.

I think my confusion might have stemmed from the fact that arquero creates arrow files in the streaming format. I'll see if it is possible to add an option to arquero to output IPC file format instead.

asfimport commented 2 years ago

Weston Pace / @westonpace: That sounds like a good plan. One thing to keep in mind is that the two formats have different purposes so it is entirely possible that arquero (I'm afraid I don't know much about this lib) is intentionally using the streaming format and the file format doesn't make sense.

Streaming format: The recipient can start processing results before they have received the entire delivery. Typically used when sending results between two processes. File format: Allows for random access to batches. Typically used when storing data on disk or some other storage device with random access capabilities.

Most (maybe all?) language implementations can read and write both formats.

asfimport commented 2 years ago

Dominik Moritz / @domoritz: Interesting discussion. I wonder whether we should by default create file IPC instead of stream since file is the more common use case. What's the default in other libraries?

asfimport commented 2 years ago

Paul Taylor / @trxcllnt: @domoritz the IPC stream format is the more common use-case, at least in real-time ETL processing. File format is useful for reading more efficiently from disk, but not suited for inter-process communication.

If a consumer process wanted the advantage of constant-time random batch access (like the File format provides), they could buffer the stream until it's finished and write the footer. However it is not possible to to process an incoming Arrow table (in the IPC File format) in batches as they arrive, as the IPC File reader blocks until it sees the footer at the end.

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

domoritz commented 9 months ago

Closing since one can change the format and defaulting to stream makes sense.