Quantco / slim-trees

Pickle your ML models more efficiently for deployment 🚀
MIT License
18 stars 1 forks source link

Jonas‘ optimization ideas #7

Open jonashaag opened 1 year ago

jonashaag commented 1 year ago

I’ll use this to brain dump a few ideas. Maybe some of them are useful.

jonashaag commented 1 year ago

STATUS: This is done for lightgbm (#15), and for sklearn we're not doing it (#19)

We could try Parquet for storing the arrays. It has great support for sparse arrays (lots of NaNs, maybe even lots of arbitrary identical values)

I think Parquet is also pretty smart about using the smallest possible integer type on disk.

Also, in Parquet repeated values are essentially free because of Run Length Encoding.

We should be able to embed Parquet data into the pickle file.

jonashaag commented 1 year ago

STATUS: We don't need this for lightgbm since it uses Parquet, and the sklearn code currently has no boolean arrays.

We can use NumPy‘s pack functionality to represent boolean arrays as bitmaps (Parquet will do this by default)

YYYasin19 commented 1 year ago

We could try Parquet for storing the arrays.

You mean storing the whole model (for example, all of the 300 trees) in a large parquet-Table, right? That seems to come in at around 10Mb without compression, and 6.7Mb with compression='gzip' enabled for an initially 20Mb large lgbm model file. This is without any further optimization, not even string parsing etc.

jonashaag commented 1 year ago

In a real-world model I just benchmarked we have ALL children_left like this:

[1, 2, 3, ..., 42, -1, -1, ..., N]

ie. it is equivalent to range(1, N+1) with some -1 for the leaves.

If we replace the -1 with some more efficient representation, we can save ~10% of final size.

Examples of more efficient representations:

jonashaag commented 1 year ago

STATUS: Parquet seems to handle this just fine, not sure about lzma

We found in the lgbm data a lot of values like 1e-35. Are they NaN? If so we could replace them by NaN and profit from Parquet's bitset-based NaN representation.

jonashaag commented 1 year ago

Combine sklearn trees into a single array to profit from potentially better Parquet compression. Eg. if your random forest has 100 trees, concat each of the 100 tree arrays, like we do with lightgbm.

Could be that this doesn't give a lot of reduction if trees are large enough and forests small enough though. We can easily check manually.

jonashaag commented 11 months ago

Use Pseudodecimal Encoding from btrblocks