dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
772 stars 177 forks source link

See what happens if we don't track thrift i32 #925

Open martindurant opened 4 months ago

martindurant commented 4 months ago

This change improves local file opening speed by 15% for large metadata (including byte reading and thrift parsing, but only the latter is actually affected).

This will cause data corruption in the case that a thrift object is directly written from one that is read, rather than created afresh. I believe this only happens in ParquetFile.remove/write_row_groups , which (rare events) must make sure to add i32 fields to the object dictionaries.

>>> pf = fastparquet.ParquetFile("test-data/root/DY1JetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8.root.parquet")
>>> %timeit pf = fastparquet.ParquetFile("test-data/root/DY1JetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8.root.parquet")

# this PR
27.8 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# main
34.4 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)