Single predictions 5x slower than xgboost

dmlc / treelite

Universal model exchange and serialization format for decision tree forests

https://treelite.readthedocs.io/en/latest/

Apache License 2.0

738 stars 100 forks source link

Single predictions 5x slower than xgboost #187

Closed vedranf closed 4 years ago

vedranf commented 4 years ago

Hello,

I just discovered treelite project and wanted to do a quick prediction test against default xgboost. I followed quick start document and that worked flawlessly. Now on to prediction:

Default xgboost:

In [293]: xgb_clf10.predict_proba(d) Out[293]: array([[9.9999821e-01, 1.7704634e-06]], dtype=float32)

In [294]: %timeit xgb_clf10.predict_proba(d) 223 µs ± 503 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now treelite:

In [295]: predictor.predict_instance(tdata[1]) Out[295]: array(1.7704634e-06, dtype=float32)

In [296]: %timeit predictor.predict_instance(tdata[1]) 1.01 ms ± 2.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I cannot use batch prediction, just one at the time. Both use single thread. Shared lib was generated using:

model.export_lib(toolchain='gcc', libpath='./xgb_clf10.so', verbose=True)

using gcc version 6.3.0 on Debian. Let me know if you need more info and whether the result above is expected.

Regards, Vedran

hcho3 commented 4 years ago

The single-instance prediction feature is experimental and so far I didn't figure out how to optimize the performance. If you have any suggestion, feel free to write here.

hcho3 commented 4 years ago

You might want to use the C function directly instead of going through the Python wrapper treelite_runtime. Take a look at https://treelite.readthedocs.io/en/latest/tutorials/deploy.html#option-2-deploy-prediciton-code-only

vedranf commented 4 years ago

Hello,

I did try C quickly using that doc and by manually populating the "inst" array with values (single row). Prediction takes ~35-40 microseconds which is very good. As for python slowness, I glanced at line_profiler output for predict_instance function, these snippets/loops take most of the time:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   335       243        293.0      1.2      9.7          for i in range(self.num_feature_):
   336       242        364.0      1.5     12.1              entry[i].missing = -1
...
   354         1          1.0      1.0      0.0              if missing is None or np.isnan(missing):
   355       243        306.0      1.3     10.1                  for i in range(inst.shape[0]):
   356       242       1277.0      5.3     42.4                      if not np.isnan(inst[i]):
   357       242        572.0      2.4     19.0                          entry[i].fvalue = inst[i]

The actual prediction takes <5% of the time. One suggestion would be to add an option to skip checking for missing or NaN values altogether in case when it is already known that input doesn't contain them (i.e. in my case preprocessing step ensures there are none), so no need to check again. Another suggestion is to avoid pure python loops when copying values from inst to entry[i].fvalue (ideally not copying at all)

Regards, Vedran

hcho3 commented 4 years ago

Closing this, since I decided to drop the single-instance prediction feature.