I wondered why our graph is always very large in bytes when using this parquet interface and dug a little. I found that the fragments are indeed quite large. What I found so far
Something like the partition expression is a pyarrow.compute Expression. Even the vanilla True, i.e. "read all partitions" expression packs already 700KB and this is attached to every file.
The parquetfileformat object is in a similar vicinity although we're rarely if ever specifiying anything and we could just as well use the default
Most importantly, though, I discovered that most things we're unpacking manually in this Fragment are ad-hoc wrapped C objects, i.e. every time I invoke Fragement.filesystem I get a new python object even though the underlying C object is identical. This confuses pickle. Pickle typically deduplicates objects but with this it doesn't stand a chance.
We may be able to ditch this wrapper for newer pyarrow versions entirely but for old ones we may have to be clever at deduplicating things. I still have to run some tests but I suspect this will reduce graph size by orders of magnitude
This is still early stage.
I wondered why our graph is always very large in bytes when using this parquet interface and dug a little. I found that the fragments are indeed quite large. What I found so far
Fragement.filesystem
I get a new python object even though the underlying C object is identical. This confuses pickle. Pickle typically deduplicates objects but with this it doesn't stand a chance.We may be able to ditch this wrapper for newer pyarrow versions entirely but for old ones we may have to be clever at deduplicating things. I still have to run some tests but I suspect this will reduce graph size by orders of magnitude