apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.23k stars 3.47k forks source link

arrow dataset: how to use date.year and date.month as partitioning #14619

Open code1704 opened 1 year ago

code1704 commented 1 year ago

The data columns: datetime, value

I want to build a dataset with partitioning like ("date.year", "date.month"). How to do that? And when do filtering like date > "2021-09-02" and date < "2022-04-06", will it read the partition files between 2021-09 to 2022-04 only?

Thanks

djouallah commented 1 year ago

yes, we do have it in BigQuery and it is freaking amazing !!

code1704 commented 1 year ago

Thanks @djouallah . What is BigQuery? How to use it? We are using duckdb. Can we use this in duckdb?

westonpace commented 1 year ago

Are you using pyarrow? Have you looked at https://arrow.apache.org/docs/python/dataset.html#writing-partitioned-data ?

And when do filtering like date > "2021-09-02" and date < "2022-04-06", will it read the partition files between 2021-09 to 2022-04 only?

Yes, that should work.

djouallah commented 1 year ago

@westonpace I think we are talking about two different thing, ideally arrow should be able to partition by a field date, but instead of generating a file by day, it will generate a file by year, I don't think it is supported yet, a workaround is to create a new field year and use it as a partition column, the problem is you have to add it as a filter in the where clause.

BigQuery and I think iceberg support that functionality partition by Year (field)

westonpace commented 1 year ago

Thank you for the clarification. You are correct that this is not supported. Just to clarify. The directories are structured something like:

/year=2020/month=01/day=01/chunk-0.parquet
...
/year=2021/month=03/day=12/chunk-0.parquet

Also, it sounds like there will be cases where you do not have the day (or maybe that is the typical case), correct?

djouallah commented 1 year ago

@westonpace I don't know how it is implemented, BigQuery don't expose its file format.

code1704 commented 1 year ago

Thank you for the clarification. You are correct that this is not supported. Just to clarify. The directories are structured something like:

/year=2020/month=01/day=01/chunk-0.parquet
...
/year=2021/month=03/day=12/chunk-0.parquet

Also, it sounds like there will be cases where you do not have the day (or maybe that is the typical case), correct?

Yes. And we use duckdb as the query engine: duckdb + arrow dataset. For query like SELECT * FROM some_dataset WHERE date BETWEEN "2015-03-05" AND "2018-04012", partition pruning is expected to work.

jmakov commented 9 months ago

A bit surprised how this isn't a bigger issue. Doesn't anybody use arrow partitions or do they just use other solutions?

steinwaywhw commented 3 months ago

Running into the same situations.

The issue is that year, month, day, when used as partition columns, can't be compared independently of each other when writing filters for push down. They should be combined.

Does Arrow allow things like "make_a_date(year, month, day) > a_date" to be pushed down?