Open jonashaag opened 1 year ago
STATUS: This is done for lightgbm (#15), and for sklearn we're not doing it (#19)
We could try Parquet for storing the arrays. It has great support for sparse arrays (lots of NaNs, maybe even lots of arbitrary identical values)
I think Parquet is also pretty smart about using the smallest possible integer type on disk.
Also, in Parquet repeated values are essentially free because of Run Length Encoding.
We should be able to embed Parquet data into the pickle file.
STATUS: We don't need this for lightgbm since it uses Parquet, and the sklearn code currently has no boolean arrays.
We can use NumPy‘s pack
functionality to represent boolean arrays as bitmaps (Parquet will do this by default)
We could try Parquet for storing the arrays.
You mean storing the whole model (for example, all of the 300 trees) in a large parquet-Table, right?
That seems to come in at around 10Mb without compression, and 6.7Mb with compression='gzip'
enabled for an initially 20Mb large lgbm model file. This is without any further optimization, not even string parsing etc.
In a real-world model I just benchmarked we have ALL children_left
like this:
[1, 2, 3, ..., 42, -1, -1, ..., N]
ie. it is equivalent to range(1, N+1)
with some -1
for the leaves.
If we replace the -1
with some more efficient representation, we can save ~10% of final size.
Examples of more efficient representations:
-1
positions-1
positions-1
as the previous value, eg. [1, 2, 3, ..., 42, 42, 42, ..., N]
; this should help with compression because it doesn't destroy the pattern as muchSTATUS: Parquet seems to handle this just fine, not sure about lzma
We found in the lgbm data a lot of values like 1e-35
. Are they NaN? If so we could replace them by NaN and profit from Parquet's bitset-based NaN representation.
Combine sklearn trees into a single array to profit from potentially better Parquet compression. Eg. if your random forest has 100 trees, concat each of the 100 tree arrays, like we do with lightgbm.
Could be that this doesn't give a lot of reduction if trees are large enough and forests small enough though. We can easily check manually.
I’ll use this to brain dump a few ideas. Maybe some of them are useful.