apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.66k stars 3.56k forks source link

[C++][Parquet] Expose key_value_metadata in parquet::ColumnChunkMetaData #42757

Open asfimport opened 7 years ago

asfimport commented 7 years ago

This is available already at the file level:

https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/metadata.h#L177

but not at the ColumnChunk level

Reporter: Wes McKinney / @wesm

Note: This issue was originally created as PARQUET-1107. Please see the migration documentation for further details.

asfimport commented 7 years ago

Ryan Blue / @rdblue: What's the use case for this? I don't think we support it in the Java version either. Curious about whether that's something we should require in the format.

asfimport commented 7 years ago

Wes McKinney / @wesm: I see. I wasn't sure if some Parquet implementations were possibly writing data to this field and we weren't allowing a way to access it (the Thrift structs are not publicly exposed in parquet-cpp)

asfimport commented 7 years ago

Rahul Kumar Challapalli: Thanks for reporting this jira [~wesm_impala_7e40]. @rdblue My use case is simple enough. I want to store the min and max for a single column, which is sorted, at the row-group level and probably at the page level as well. Am I missing an obvious way to do this?

asfimport commented 7 years ago

Wes McKinney / @wesm: Ah, you want to use the built-in statistics for that rather than key-value metadata

asfimport commented 7 years ago

Rahul Kumar Challapalli: @wesm Thank you, I knew something simple like this should have been there. Now how are these statistics populated? I would like to either programatically set them (for the min/max case) or provide a comparator. Also I am using arrow abstraction over parquet readers and writers. It would be helpful if you can point me to code/tests which write & read statistics.