apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.31k stars 678 forks source link

Support writing `IntervalMonthDayNanoArray` to parquet via Arrow Writer #5849

Closed marvinlanhenke closed 2 weeks ago

marvinlanhenke commented 3 weeks ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

While working on the support for converting parquet statistics into ArrayRefs in DataFusion (see apache/datafusion#10453). I noticed that currently the ColumnWriter does not support writing IntervalUnit::MonthDayNano.

This might be the location: https://github.com/apache/arrow-rs/blob/fa8d3502388d7cfac724f7b9fae92abc3a716b6f/parquet/src/arrow/arrow_writer/mod.rs#L854-L874

Describe the solution you'd like

Support for writing IntervalUnit::MonthDayNano in the ColumnWriter.

Describe alternatives you've considered

Additional context

Related to: apache/arrow-rs#5847

alamb commented 3 weeks ago

i updated the title of this ticket to reflect the end behavior I think it is addressing

Specifically, I think the gap identified by @marvinlanhenke above is that trying to write an IntervalMonthDayNano array to parquet via https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html will not work

TO proceed with this ticket the first thing would probably be to make a small test case to verify that IntervalMonthDayNanoArray can not be written to parquet

tustvold commented 2 weeks ago

I don't believe the parquet specification allows for supporting nanosecond intervals - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval

I am therefore not sure this ticket is actionable...

alamb commented 2 weeks ago

I wonder what should the guidance be for people who have IntervalMonthDayNano arrays and want to write the data to Parquet 🤔

Is it "cast the data to an interval type that is supported (IntervalMonthDay)? If so I can add a note to the docs

Another potential option would be "write this type in as a FIXED_LENGTH_BYTE_ARRAY or something (with no parquet logical type) -- which would permit round tripping data written by parquet-rs back to ArrayRef but would not be readable by any other implementation

I dug around in arrow and found some suggestions jave doesn't support it either https://github.com/apache/arrow/blob/65974672a356f34889ed7b9bfb8b76230c27c7ee/java/dataset/src/test/java/org/apache/arrow/dataset/TestAllTypes.java#L94-L96

tustvold commented 2 weeks ago

cast the data to an interval type that is supported

Documenting this I think this is the least potentially controversial path forward

alamb commented 2 weeks ago

Proposed documentation update: https://github.com/apache/arrow-rs/pull/5875