This change improves local file opening speed by 15% for large metadata (including byte reading and thrift parsing, but only the latter is actually affected).
This will cause data corruption in the case that a thrift object is directly written from one that is read, rather than created afresh. I believe this only happens in ParquetFile.remove/write_row_groups , which (rare events) must make sure to add i32 fields to the object dictionaries.
>>> pf = fastparquet.ParquetFile("test-data/root/DY1JetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8.root.parquet")
>>> %timeit pf = fastparquet.ParquetFile("test-data/root/DY1JetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8.root.parquet")
# this PR
27.8 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# main
34.4 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This change improves local file opening speed by 15% for large metadata (including byte reading and thrift parsing, but only the latter is actually affected).
This will cause data corruption in the case that a thrift object is directly written from one that is read, rather than created afresh. I believe this only happens in ParquetFile.remove/write_row_groups , which (rare events) must make sure to add i32 fields to the object dictionaries.