apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.44k stars 3.52k forks source link

Will JS similar to Py Arrow ever have the ability to read parquet from disk into arrow? #10244

Closed ali-habibzadeh closed 3 years ago

ali-habibzadeh commented 3 years ago

From what I understood, Parquet is for storage and arrow for in memory querying, are you planning to offer this on the JS side or that project is mainly for learning only?

Similarly, it seems in the Python version one can specify partitioning options for writing multiple files which is not present in the JS version but helps when large amounts of data is involved.

Also there seems to be no way of reading JSON data either and provide arrow schema and load it into a table.

It's an exciting technology, looking forward to see it mature more!

westonpace commented 3 years ago

From what I understood, Parquet is for storage and arrow for in memory querying

I would leave it at "arrow is for in memory" but yes, you are correct.

are you planning to offer this on the JS side or that project is mainly for learning only?

You could search the mailing list, that's probably the closest you will come to a project-wide long term plan. However, from a cursory search, I do not see anyone actively working on this feature. I don't know of any reason it couldn't happen. I'm not sure what you mean by "learning only"? There are many use cases for JS projects that don't read files from disk, for example, any browser project.

Parquet would be a nice feature, especially for node-based backend servers, but it isn't a necessary feature for data analysis. For example, there are many visualization libraries written in JS. These libraries can just accept Arrow data from external applications via IPC and don't need to read it from disk themselves.

Similarly, it seems in the Python version one can specify partitioning options for writing multiple files which is not present in the JS version but helps when large amounts of data is involved.

This is correct. This is the "datasets" API and it is not part of the Arrow Columnar Format and at the moment I think it is limited to the implementations based on C++ (Python,R,Ruby,C/Glib).

Also there seems to be no way of reading JSON data either and provide arrow schema and load it into a table.

There is no implementation that currently reads JSON (https://arrow.apache.org/docs/status.html#third-party-data-formats). There is nothing preventing it but it has not been a high enough priority for anyone to implement.

ali-habibzadeh commented 3 years ago

Thanks. Confirmed my thoughts. For a node.js serverless or backend application this is not an option as a query engine. It's just for browser based things.

backend apps need to ingest (arrow is too large compared to parquet to be a storage strategy), partition, load, query and deliver and most of that is missing for this format, making it from a node.js stand point more of a introduction to arrow project for learning and not for building serious apps.

Of course it's ok to also have recipes using multitude of tools put together to cover application building patterns using arrow in node.js but from my research so far so such resource or eco system exists either. Including an unssucessful attempt at integrating with https://github.com/ironSource/parquetjs.

Thank you for your detailed answers.