Closed alamb closed 3 weeks ago
take
...while looking into this I noticed, that there are no statistics written for an Interval
, which is also described here.
@alamb I guess we can't extract any statistics here? And writing any tests that check we have no statistics written, does not seem to be very helpful?
@alamb I guess we can't extract any statistics here? And writing any tests that check we have no statistics written, does not seem to be very helpful?
I actually think these would be helpful then as soon as there are statistics we can hook them up to the tests. If you had time to write the tests that would be great. We can then perhaps file a ticket in parquet-rs for supporting writing statistics to interval types.
sure I can do that; from the top of my mind - the fn run()
from the struct Test
panics if we can't extract any statistics, which is the case here. So I'd prepare as much as possible (creating record batches, adding a Scenario, writing those tests) but for now would assert should panic
- does this make any sense to you @alamb?
I did some digging in order to find out why / or where the writing of those statistics is not supported (yet). Since I'm not familiar with the parquet impl, here are my findings, which might be useful in a follow-up ticket in arrow-rs.
fn write_slice()
the min, max values are never updated due to a filter-condition; that checks if the type is INTERVALI think this should be possible, or put differently, I don't see the reason yet, why this is not supported? Somethin similar (comparing FixedLenByteArrays) is already done for DECIMAL here?
Perhaps, you have some more information on this @alamb - otherwise this might be enough information to file a ticket in arrow-rs?
Is your feature request related to a problem or challenge?
Part of https://github.com/apache/datafusion/issues/10453, where we are filling out support for extracting statistics for all data types from parquet files
At the moment, even if statistics are extracted for a different type (like
Int32
) the PruningPredicate will attempt to cast these values to the correct type:https://github.com/apache/datafusion/blob/acd7106fa40fad58f50ae06227971c51073d8f48/datafusion/core/src/physical_optimizer/pruning.rs#L909-L911
However, in order to be efficient and ensure the cast kernel doesn't add anything incorrectly, we should be extracting the parquet statistics as the correct Array type directly. It turns out we do not do this yet for several types and those types do not have good (or any) test coverage. We almost missed this in https://github.com/apache/datafusion/pull/10711 in @xinlifoobar
Thus, we need to add support and tests for other types
Describe the solution you'd like
cargo test --test parquet_exec
) with the relevant typeHere are some example PRs:
Describe alternatives you've considered
No response
Additional context
No response