apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.16k stars 3.45k forks source link

[Parquet] Support for writing binary column in stream writer in Parquet #14998

Open letmaik opened 1 year ago

letmaik commented 1 year ago

Describe the enhancement requested

There is no operator<< overload in Parquet's stream_writer.h that accepts std::vector<uint_8> or similar such that data is written with converted type "none". It would be useful to have this.

Inheriting the existing StreamWriter class and using its WriteVariableLength protected function isn't possible currently as it enforces UTF-8:

https://github.com/apache/arrow/blob/91ee6dad722ee154d63eea86ce5644e1e658b53b/cpp/src/parquet/stream_writer.cc#L143-L145

Component(s)

C++, Parquet

vdemichev commented 2 months ago

This is also undocumented. Requires looking at the sources to see that StreamWriter can only write UTF8. Makes it impossible to write binary data that will be then correctly read by e.g. Polars (Python).