Open j-bennet opened 1 year ago
Currently in dask-deltatable, we're using pyarrow.dataset.dataset, which we filter with a pyarrow.Expression:
pyarrow.dataset.dataset
pyarrow.Expression
https://github.com/dask-contrib/dask-deltatable/blob/dbeb8cc3f94ac6bc612e5dab6f8d3440f37455e6/dask_deltatable/core.py#L78
Would the ParquetDataset be more appropriate here? It can accept filters as Expression, or tuple/DNF form, which would allow us to skip that filters_to_expression step.
ParquetDataset
Expression
tuple
filters_to_expression
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
Currently in dask-deltatable, we're using
pyarrow.dataset.dataset
, which we filter with apyarrow.Expression
:https://github.com/dask-contrib/dask-deltatable/blob/dbeb8cc3f94ac6bc612e5dab6f8d3440f37455e6/dask_deltatable/core.py#L78
Would the
ParquetDataset
be more appropriate here? It can accept filters asExpression
, ortuple
/DNF form, which would allow us to skip thatfilters_to_expression
step.https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html