Closed jinluyang closed 7 months ago
Treelite is the fastest engine for forest on CPU as far as I know. (Anyone has a better solution?)So it is pitty that @hcho3 no longer maintains it. @khchenTW May you take a look if you have time?
Hi Jinlu, your finding is interesting. Without diffing the underlying implementation, I am also in the dark. I am currently on vacation -- I will take a look when I am back, or @hcho3 may answer the thread later.
@jinluyang Can you inspect the generated C code from Treelite and TL2cgen? You can obtain the C code by running model.compile(...)
and tl2cgen.generate_c_code(model, ...)
.
@jinluyang Can you inspect the generated C code from Treelite and TL2cgen? You can obtain the C code by running
model.compile(...)
andtl2cgen.generate_c_code(model, ...)
.
@hcho3 I found that the difference is in the main.c, in which treelite has a "inline" pred_transform while tl2cgen do not.
I think this might not be the reason, because I find tl2cgen can use the .so library exported by treelite, but runs it as slow as before. So I guess the difference might be due to something at runtime.
Have you tried exporting .so file from tl2cgen and then using it from treelite_runtime
3.9 ? How's the performance like?
Have you tried exporting .so file from tl2cgen and then using it from
treelite_runtime
3.9 ? How's the performance like?
Yes, just tried. And their performance is just the same as previously. tl2cgen is slower. @hcho3
So it appears that tl2cgen’s runtime is slower than treelite_runtime ?
So it appears that tl2cgen’s runtime is slower than treelite_runtime ?
I think so
The old Treelite runtime uses a custom implementation of thread pools, whereas tl2cgen’s runtime uses OpenMP. The custom thread pool is probably better for performance, but I decided to replace it with OpenMP for two reasons.
I will close the issue for now. We can open a new issue if we think a custom thread pool is merited.
Might be related with some palletization scheme implemented in OpenMP. If I use the Treelite runtime, my prediction with RF (60 trees) take advantage of all cores, while tl2cgen’s runtime uses only half of the cores.
So it appears that tl2cgen’s runtime is slower than treelite_runtime ?
I think so
I did some extensive testing on the runtimes of TL2cgen and Treelite (using sklearn datasets), along with moving Treelite's thread pool to TL2cgen and moving the OpenMP implementation to Treelite. It seems that for small instances like the one you showed Treelite is indeed faster. Why this is the case is unknown. However, if you make a random forest and play around with the number of trees and vary the sample size of the data set, you'll find that TL2cgen and Treelite have very little difference in performance. Fourie_BA_EEMCS.pdf
environment: 1 CPU core 1GB memory, centos 7 python3.8 treelite==3.9.0 treelite_runtime==3.9.0 tl2cgen==0.3.1
the code to reproduce is :
It takes 0.012 s for treelite_runtime and 0.021s for tl2cgen, which is about twice slower. I can see deprecation warning that saying I should use tl2cgen.Predictor rather than treelite_runtime, but it is slower. Any suggestions?