dask-contrib / dask-deltatable

A Delta Lake reader for Dask
BSD 3-Clause "New" or "Revised" License
44 stars 14 forks source link

Reuse metadata from deltalake when reading parquet #22

Open j-bennet opened 1 year ago

j-bennet commented 1 year ago

In dask-deltatable, when calling dd.read_parquet, perhaps we can reuse the metadata already preserved in delta json, instead of collecting it from parquet files all over again.

Here:

https://github.com/dask-contrib/dask-deltatable/blob/cd731a9a237c84c824aff1e8ea61d4ba6988f3b9/dask_deltatable/core.py#L196

It looks like dd.read_parquet will have to go through the parquet files to read the metadata, but the DeltaTable should have all that info already.

jrbourbeau commented 1 year ago

Good catch @j-bennet. I think adding dataset={"schema": dt.schema().to_pyarrow()} as a keyword to this read_parquet call

https://github.com/dask-contrib/dask-deltatable/blob/cd731a9a237c84c824aff1e8ea61d4ba6988f3b9/dask_deltatable/core.py#L196

should do the trick. Though it'd be nice if someone could confirm this is the case.

j-bennet commented 1 year ago

I think adding dataset={"schema": dt.schema().to_pyarrow()} as a keyword to this read_parquet call should do the trick. Though it'd be nice if someone could confirm this is the case.

I think Delta Log also contains columns stats, so maybe we can avoid gathering those.