dask-contrib / dask-deltatable

A Delta Lake reader for Dask
BSD 3-Clause "New" or "Revised" License
46 stars 15 forks source link

Consider using dataset fragments instead of parquet uris #13

Open wjones127 opened 1 year ago

wjones127 commented 1 year ago

I saw this example and thought it might be a better way to load data.

Right now it looks like you rely on being able to just read from the Parquet files and load the partition values from HIVE-style directories. This isn't robust in two ways:

  1. HIVE-style directories aren't guaranteed in the Delta Lake format. The delta protocol states that "This directory format is only used to follow existing conventions and is not required by the protocol. Actual partition values for a file must be read from the transaction log." ^1
  2. Deletion vectors and column mapping mean reading the parquet files as-is won't give you the correct data, once we start supporting reader protocols 2 and 3.

In the future, it would be best not to rely on reading from the file URIs and instead read from the dataset fragments, which will provide the correct data as the Delta Protocol continues to evolve.

wjones127 commented 1 year ago

BTW, this would allow column projection pushdown, based on this protocol: https://github.com/dask/dask/blob/12c7c10d0c15391a6522fe2dc7df191f8088967e/dask/dataframe/io/utils.py#L224

Maybe that same protocol will support filter pushdown too?

wjones127 commented 1 year ago

FYI I'm working towards standardizing the interface for PyArrow datasets to make it easier for engines to consume, including Dask. My research for that is how I found that. If interested, feel free to read and/or comment on this document. https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?usp=sharing