Reduce RAM and ROM footprint

skjerns commented 5 years ago

I'm using m2cgen to convert some classifier to C. It works great and results are consistent, thanks for the library!

I have the problem that the compiled binaries are too large to fit on my embedded device. I checked and the binaries are around double the size of the binaries created with e.g sklearn_porter. However, m2cgen is the only libraries that can convert my python classifiers to C without introducing errors into the classification.
Even if I reduce the size of the classifier, I run into the problem that the RAM of the device is exceeded (think of something in the kB range).

Do you have any idea how the footprint of the c code could be reduced?

izeigerman commented 5 years ago

Hi @skjerns ! Thanks for the feedback, I'm very glad to hear that you find m2cgen to be useful! Can you please provide a bit more details like:

What kind of model do you use (algorithm wise)? With what parameters?
How large is your input vector? (How many features are there?)
What binary size are we talking about here?

skjerns commented 5 years ago

For one test I'm using RandomForest
I'm using 8 features and variable size of inputs
I think something around 600kB vs 300kB for sklearn_porter. RAM requirements make the device crash, so I can measure them.

izeigerman commented 5 years ago

Hey @skjerns . Thanks for the update. Random Forest can indeed be pretty huge sometimes. How many estimators did you end up having? What was the maximum depth of an individual estimator? I'd like to try to reproduce this on my end to better understand what can be improved here.

skjerns commented 5 years ago

take for instance this code:

# -*- coding: utf-8 -*-
import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import joblib
import m2cgen
import sklearn_porter
import subprocess

train_x = np.random.rand(10000, 8)
train_y = np.random.randint(0, 4, 10000)

rfc = RandomForestClassifier(n_estimators=10, max_depth=10)
rfc.fit(train_x, train_y)
joblib.dump(rfc, 'rfc.pkl')

# transfer code
code1 = m2cgen.export_to_c(rfc)
code1 += '\nint main(int argc, const char * argv[]) {return 0;}'
with open('rfc_m2cgen.c', 'w') as f:
    f.write(code1)

porter = sklearn_porter.Porter(rfc, language='c')
code2  = porter.export(embed_data=True)
with open('rfc_porter.c', 'w') as f:
    f.write(code2)

# now compile the two
# assuming you're using windows, else it will be slightly different
subprocess.call('gcc rfc_m2cgen.c -o rfc_m2cgen.exe')
subprocess.call('gcc rfc_porter.c -o rfc_porter.exe')

print('m2cgen: {} kB'.format(os.path.getsize('rfc_m2cgen.exe')//1024))
print('porter: {} kB'.format(os.path.getsize('rfc_porter.exe')//1024))

#m2cgen: 370 kB
#porter: 152 kB

The compiled files are twice the size. This also holds true when compiling for other architectures. Similarly the RAM footprint is much higher, but I have no way of measuring this easily.

Do you know if there are any optimizations possible to reduce this?

skjerns commented 5 years ago

Might this be due to the excessive use of memcopy, that this blow up the code/execution?

izeigerman commented 5 years ago

@skjerns I'd say that the difference in binary size is explained by the fact that m2cgen and sklearn-porter took quite different approaches to code generation.

m2cgen encodes the entire model into the code itself. It doesn't rely on any data structures or language constructs other than if statement. All model coefficients are encoded in place where they are needed as plain literals. This approach has its pros and cons. By using only most simplistic language constructs we can add support of new models and languages extremely fast. Once model's AST is described, all languages automatically get support of this model without any extra effort. Similarly once some language support is implemented, this language automatically gains support of all available models. Adding a new model is as easy as converting it into a simplistic AST which represents a sequence of calculations without caring about data aspect. Of course this benefit comes with the cost - the generated code is not very readable and usually pretty large in size. Eg. in places where we could just use a for loop we expand all iterations instead.

sklearn-porter does quite the opposite. It carefully describes generation of each model by using manually written templates for each supported language. Model coefficients are stored in language-specific collections, all calculations are implemented manually as well. During the code generation phase it just injects model parameters into those data collections for each language individually. This obviously leads to a smaller and much more readable code, since it's been basically written by a human. This approach however requires a tremendous effort when it comes to adding new models or languages. Cost of maintenance of this functionality is pretty high as well. I believe this is partially a reason why the list of models supported by sklearn-porter is quite limited.

So far I don't have any good ideas on how to reduce the size of the generated code while avoiding the language-specific manual effort and keeping all the benefits I described above. However I haven't given up yet and still working on this 😃

skjerns commented 5 years ago

@izeigerman thank's for the extensive explanation!

I do see your point of going with a different approach and I think your approach has definitely advantages. Let me know if you have any insights :)

codeyp commented 5 years ago

K will do

beojan commented 5 years ago

Perhaps you could add loops to this, since all three supported languages have loops?

BayesWitnesses / m2cgen

Reduce RAM and ROM footprint #88