XGBoost MLeap bundle speed

drei34 commented 2 years ago

Hi, I have the question below from another repo that I think is no longer active, so I pasted it. Basically, I don't quite understand why with bigger data batch sizes MLeap XGBoost bundle seems to be running faster. I assume it is threading but unsure. Please let me know that and if I can turn off such optimizations, I am trying to compare with something that's unoptimized currently.

jsleight commented 2 years ago

This will depend on which xgboost runtime you are using. We have two xgboost runtimes:

[default] dmlc xgboost which is C++ behind the scenes. DMLC xgboost does have threading (search xgboosts to figure out how to control that). But in addition to the threading dmlc xgboost's prediction performance is better for batch operations because a) we can streamline the jni transfer and b) xgboost's c++ uses avx instructions to literally batch computations.
XGBoost-Predictor which is a java native re-implementation of xgboost's inference engine. This typically has much better single row prediction times (because no jni transfers), but is less performant for batch prediction since it doesn't do threading or avx vectorizing.

See https://github.com/combust/mleap/tree/master/mleap-xgboost-runtime has details on how to swap between them

P.S. I'm guessing your chart is showing the stats per row and not the aggregate for batch size? I.e., the mean time for batch_size=20 is 0.625*20 in aggregate. It would be pretty surprising to me if predict(50_rows) completed faster than predict(1_row)

drei34 commented 2 years ago

Thanks! I ran for 1000 iterations for a fixed batch size, so for example 1000 iterations of batch size 1 took 1.05 1000 ms. For batch size 20, it was 0.625 1000 and for size 50 it was 0.468 * 1000. So yes, I'm showing predict(50_rows) < predict(1_rows) which is what is curious. This is not expected? Do you have a slack channel btw?

drei34 commented 2 years ago

To be clear ... I am making a Transformer in java from the mleap bundle and then just taking the prediction time for different data frames I generate of a fixed size. And this is giving me this counterintuitive result ...

jsleight commented 2 years ago

I definitely would not expect predict(50_rows) < predict(1_rows).

predict(50_rows) / 50 < predict(1_rows) would obviously make sense.

Only ideas I have are some weirdness in the benchmarking setup like cache warming, startup, etc.. If you're not already, then using jmh for benchmarking is usually helpful for eliminating that kind of noise.

drei34 commented 2 years ago

Right I also am a bit weirded out by this but also in production the worst latencies I saw seem to have been by requests which have a 1 feature row; more feature rows seems to do better (a large batch has better latency than 1 row, predict(50_rows) < predict(1_row)). So this is confirming what I see but it does not make sense and I'm trying to get an understanding of it ... Is it possible that the reason this is happening is that in the case when batches are small the number of threads that go up is large and then they "wait" to come down and this has some sort of inefficiency? I did not use jmh yet but I'm also loading another model and for this model when the number of rows grows the latency grows, which makes sense (predict(50_rows) > predict(1_rows)). The only thing I can come up with currently is the threading inside of the bundle has some optimization specific to larger batches and it's detrimental to smaller batches ... Can try jmh and come back or maybe a quick zoom?

combust / mleap

XGBoost MLeap bundle speed #833