Open code1704 opened 1 year ago
yes, we do have it in BigQuery and it is freaking amazing !!
Thanks @djouallah . What is BigQuery? How to use it? We are using duckdb. Can we use this in duckdb?
Are you using pyarrow? Have you looked at https://arrow.apache.org/docs/python/dataset.html#writing-partitioned-data ?
And when do filtering like date > "2021-09-02" and date < "2022-04-06", will it read the partition files between 2021-09 to 2022-04 only?
Yes, that should work.
@westonpace I think we are talking about two different thing, ideally arrow should be able to partition by a field date, but instead of generating a file by day, it will generate a file by year, I don't think it is supported yet, a workaround is to create a new field year and use it as a partition column, the problem is you have to add it as a filter in the where clause.
BigQuery and I think iceberg support that functionality partition by Year (field)
Thank you for the clarification. You are correct that this is not supported. Just to clarify. The directories are structured something like:
/year=2020/month=01/day=01/chunk-0.parquet
...
/year=2021/month=03/day=12/chunk-0.parquet
Also, it sounds like there will be cases where you do not have the day (or maybe that is the typical case), correct?
@westonpace I don't know how it is implemented, BigQuery don't expose its file format.
Thank you for the clarification. You are correct that this is not supported. Just to clarify. The directories are structured something like:
/year=2020/month=01/day=01/chunk-0.parquet ... /year=2021/month=03/day=12/chunk-0.parquet
Also, it sounds like there will be cases where you do not have the day (or maybe that is the typical case), correct?
Yes. And we use duckdb as the query engine: duckdb + arrow dataset. For query like SELECT * FROM some_dataset WHERE date BETWEEN "2015-03-05" AND "2018-04012"
, partition pruning is expected to work.
A bit surprised how this isn't a bigger issue. Doesn't anybody use arrow partitions or do they just use other solutions?
Running into the same situations.
The issue is that year, month, day, when used as partition columns, can't be compared independently of each other when writing filters for push down. They should be combined.
Does Arrow allow things like "make_a_date(year, month, day) > a_date" to be pushed down?
The data columns: datetime, value
I want to build a dataset with partitioning like ("date.year", "date.month"). How to do that? And when do filtering like date > "2021-09-02" and date < "2022-04-06", will it read the partition files between 2021-09 to 2022-04 only?
Thanks