apache / arrow-julia

Official Julia implementation of Apache Arrow
https://arrow.apache.org/julia/
Other
285 stars 60 forks source link

Official support for the Apache Parquet format #410

Open kazuakiyama opened 1 year ago

kazuakiyama commented 1 year ago

I'm a radio astronomer interested in using this Julia-native implementation of the Apache Arrow in-memory format for black hole imaging with the Event Horizon Telescope. First of all, thanks for developing this package! We get interested in this package because the Apache Arrow and Parquet formats have been considered as a major candidate for the next generation radio astronomy data format.

I'm wondering if the package envisions implementing IO functions of the Apache Parquet format in the future. I read a previous issue regarding this topic. I believe that no method is yet available to directly load/write columnar data in Parquest file into the Arrow.jl's in-memory data ---- the only way to handle this in a pure Julia way seems to be converting disk-based data into the one in the Apache IPC format by using both Parquet.jl and Arrow.jl, and then reloading it into memory using Arrow.jl.

This seems to be a bit problematic for our use case appearing as a major issue preventing us from using this package and apache's columnar formats in Julia. I think the key issues are

Given a lot of similarities and cross sections between the specifications of the Apache Parquet and Arrow formats, I feel it is more straightforward to request the IO features of Parquet formats in Arrow.jl rather than request some missing features to the existing Julia Parquet packages. Any thoughts on this are appreciated. Thanks!

Moelf commented 1 year ago

Parquet and Arrow (Feather) are two completely different format. so no, the fact that we have a more polished Arrow.jl doesn't mean we can build a Parquet here

kazuakiyama commented 1 year ago

Thanks for the reply. It is a bit of bad news for us, but we will explore an alternative option.