apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.19k stars 3.46k forks source link

[Python] Union dataset from partitioned dataset and table #41472

Open adriangb opened 4 months ago

adriangb commented 4 months ago

Describe the usage question you have. Please include as many useful details as possible.

My understanding is that in a partitioned dataset query engines can push down to do partition pruning. How does this work with a union dataset? I'm specifically interested in doing a union between an in-memory dataset and partitioned dataset, e.g. the result of:

dataset([dataset(table), dataset('path/to/partitioned/files')])

Can I somehow create a virtual partitioned dataset from the table and have that work with predicate pushdown? Does predicate pushdown work with a union dataset like this in general?

Component(s)

Python

AlenkaF commented 3 months ago

I am not 100% sure but it might need a new feature: partition_expression would most probably do what you need but it is not exposed in the dataset method. Would you be willing to experiment a with it?