dmlc / tl2cgen

TL2cgen (TreeLite 2 C GENerator) is a model compiler for decision tree models
https://tl2cgen.readthedocs.io/en/latest/
Apache License 2.0
17 stars 6 forks source link

Why does it seem that tl2cgen is slower than treelite_runtime? #18

Closed jinluyang closed 7 months ago

jinluyang commented 8 months ago

environment: 1 CPU core 1GB memory, centos 7 python3.8 treelite==3.9.0 treelite_runtime==3.9.0 tl2cgen==0.3.1

the code to reproduce is :

import treelite
import treelite_runtime
import tl2cgen
import numpy as np
import time

builder = treelite.ModelBuilder(num_feature=3)
tree = treelite.ModelBuilder.Tree()
tree[0].set_numerical_test_node(
        feature_id=0,
        opname="<",
        threshold=5.0,
        default_left=True,
        left_child_key=1,
        right_child_key=2
        )
tree[1].set_leaf_node(0.6)
tree[2].set_leaf_node(-0.4)
tree[0].set_root()
builder.append(tree)
model = builder.commit()  # Obtain treelite.Model object
input = np.random.rand(1,2)

def test_rt(cnt):
    so_name = "tmp.so"
    model.export_lib(toolchain='gcc',
            libpath=so_name, verbose=True, params={'parallel_comp': 1})
    p = treelite_runtime.Predictor(so_name)
    dmat = treelite_runtime.DMatrix(input)
    start = time.time()
    for _ in range(cnt):
        ret = p.predict(dmat)
    end = time.time()
    print(f'time consumed for {cnt} treelite_runtime predictions : {end-start}')
    print(ret)

def test_tl2cgen(cnt):
    so_name = "tmp1.so"
    tl2cgen.export_lib(model, toolchain='gcc',
            libpath=so_name, verbose=True, params={'parallel_comp': 1})
    p = tl2cgen.Predictor(so_name)
    dmat = tl2cgen.DMatrix(input)
    start = time.time()
    for _ in range(cnt):
        ret = p.predict(dmat)
    end = time.time()
    print(f'time consumed for {cnt} tl2cgen predictions : {end-start}')
    print(ret)
test_rt(1000)
test_tl2cgen(1000)

It takes 0.012 s for treelite_runtime and 0.021s for tl2cgen, which is about twice slower. I can see deprecation warning that saying I should use tl2cgen.Predictor rather than treelite_runtime, but it is slower. Any suggestions?

jinluyang commented 8 months ago

Treelite is the fastest engine for forest on CPU as far as I know. (Anyone has a better solution?)So it is pitty that @hcho3 no longer maintains it. @khchenTW May you take a look if you have time?

khchenTW commented 8 months ago

Hi Jinlu, your finding is interesting. Without diffing the underlying implementation, I am also in the dark. I am currently on vacation -- I will take a look when I am back, or @hcho3 may answer the thread later.

hcho3 commented 7 months ago

@jinluyang Can you inspect the generated C code from Treelite and TL2cgen? You can obtain the C code by running model.compile(...) and tl2cgen.generate_c_code(model, ...).

jinluyang commented 7 months ago

@jinluyang Can you inspect the generated C code from Treelite and TL2cgen? You can obtain the C code by running model.compile(...) and tl2cgen.generate_c_code(model, ...).

@hcho3 I found that the difference is in the main.c, in which treelite has a "inline" pred_transform while tl2cgen do not. image I think this might not be the reason, because I find tl2cgen can use the .so library exported by treelite, but runs it as slow as before. So I guess the difference might be due to something at runtime.

hcho3 commented 7 months ago

Have you tried exporting .so file from tl2cgen and then using it from treelite_runtime 3.9 ? How's the performance like?

jinluyang commented 7 months ago

Have you tried exporting .so file from tl2cgen and then using it from treelite_runtime 3.9 ? How's the performance like?

Yes, just tried. And their performance is just the same as previously. tl2cgen is slower. @hcho3

hcho3 commented 7 months ago

So it appears that tl2cgen’s runtime is slower than treelite_runtime ?

jinluyang commented 7 months ago

So it appears that tl2cgen’s runtime is slower than treelite_runtime ?

I think so

hcho3 commented 7 months ago

The old Treelite runtime uses a custom implementation of thread pools, whereas tl2cgen’s runtime uses OpenMP. The custom thread pool is probably better for performance, but I decided to replace it with OpenMP for two reasons.

hcho3 commented 7 months ago

I will close the issue for now. We can open a new issue if we think a custom thread pool is merited.

leandroleal commented 3 months ago

Might be related with some palletization scheme implemented in OpenMP. If I use the Treelite runtime, my prediction with RF (60 trees) take advantage of all cores, while tl2cgen’s runtime uses only half of the cores.

Treelite runtime

image

tl2cgen’s runtime

image

AF079 commented 2 weeks ago

So it appears that tl2cgen’s runtime is slower than treelite_runtime ?

I think so

I did some extensive testing on the runtimes of TL2cgen and Treelite (using sklearn datasets), along with moving Treelite's thread pool to TL2cgen and moving the OpenMP implementation to Treelite. It seems that for small instances like the one you showed Treelite is indeed faster. Why this is the case is unknown. However, if you make a random forest and play around with the number of trees and vary the sample size of the data set, you'll find that TL2cgen and Treelite have very little difference in performance. Fourie_BA_EEMCS.pdf