dask / dask-expr

BSD 3-Clause "New" or "Revised" License
79 stars 18 forks source link

reduce pickle size of parquet fragments #1050

Closed fjetter closed 2 months ago

fjetter commented 2 months ago

This is still early stage.

I wondered why our graph is always very large in bytes when using this parquet interface and dug a little. I found that the fragments are indeed quite large. What I found so far

We may be able to ditch this wrapper for newer pyarrow versions entirely but for old ones we may have to be clever at deduplicating things. I still have to run some tests but I suspect this will reduce graph size by orders of magnitude

phofl commented 2 months ago

thx