Open wjones127 opened 1 year ago
BTW, this would allow column projection pushdown, based on this protocol: https://github.com/dask/dask/blob/12c7c10d0c15391a6522fe2dc7df191f8088967e/dask/dataframe/io/utils.py#L224
Maybe that same protocol will support filter pushdown too?
FYI I'm working towards standardizing the interface for PyArrow datasets to make it easier for engines to consume, including Dask. My research for that is how I found that. If interested, feel free to read and/or comment on this document. https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?usp=sharing
I saw this example and thought it might be a better way to load data.
Right now it looks like you rely on being able to just read from the Parquet files and load the partition values from HIVE-style directories. This isn't robust in two ways:
In the future, it would be best not to rely on reading from the file URIs and instead read from the dataset fragments, which will provide the correct data as the Delta Protocol continues to evolve.