Closed goupdown closed 6 years ago
Could you show your bulk prediction code?
Bulk prediction code is in the original message above right before the line: printf("Bulk predict: %dms\n", time() - st);
Why is pure prediction is 2x slower? How can existence of other instances in the matrix improve the performance of the prediction of one particular instance?
Aren't you calling XGBoosterPredict once in one case and 883567 times in the other? I don't think it's surprising that making a single prediction on one million rows of data is more efficient than making one million separate predictions on one single row each time. What were your expectations?
Is there anyone know that how xgoost do prediction in bulk mode? I read http://scikit-learn.org/stable/modules/computational_performance.html , for logistic regression or SVM, it is easy to think the prediciton in bulk in matrix computation way. It is interesting to know how to do bulk prediction in tree mode.
Aren't you calling XGBoosterPredict once in one case and 883567 times in the other? I don't think it's surprising that making a single prediction on one million rows of data is more efficient than making one million separate predictions on one single row each time. What were your expectations?
My expectation is that two codes below do the same and with the same performance.
Code 1:
int x = 0;
for (int i = 0; i < 1000000; i++) {
x = x + 1;
}
Code 2:
int x = 0;
void inc() {
x = x + 1;
}
for (int i = 0; i < 1000000; i++) {
inc();
}
Why do you think function call should take meaningful time?
My guess is that XGBoost does some preparations inside predict() function which are not meaningful in case of 1mln matrix prediction, but decrease performance in case of single prediction. Production use cases are single usually. Imagine I need to recognize faces of the people entering the building or give an answer whether to approve credit for some person. There is no 1mln data matrix in advance in the real life.
I think that case 1 and case 2 are different , if they have equal performance. I guessed that C++ compiler have done the opimisation.
I guessed that C++ compiler have done the opimisation
Yes it could. But this optimization will not change the performance significantly.
Why did not compiler do the same optimization in case of XGBoost predict() method? Actually this is the root of my original question. To do the prediction XGBoost needs to go thru several trees and sum the result (roughly speaking). Why is doing this for 1mln instances matrix is faster than for 1 instance matrix 1mln times? In any case for each instance we have float[] as an input and the same trees to go thru.
Please do not say that function call takes time in C++. This is not relevant for our case.
After jumping through several hoops, you end up running a loop over the rows in the matrix. The extra time comes from the overhead of getting to that point one million times instead of just once. There is no need to guess what happens, you can see the code.
After jumping through several hoops, you end up running a loop over the rows in the matrix. The extra time comes from the overhead of getting to that point one million times instead of just once. There is no need to guess what happens, you can see the code.
Yes you are right. And my original issue is that XGBoost code should be changed to avoid this single predict() function call overhead. Do you think this is possible?
Actually there two steps:
Are you using an OMP enabled version ? Internally I assume xgboost uses OMP to run the prediction in parallel on the rows. While your for loop is written without OMP and uses only 1 thread. Now, if you have 4 cores ... it would be 4 times slower.
2 tests to understand better:
USE_OPENMP=0
( https://github.com/dmlc/xgboost/blob/master/make/config.mk#L30 )time
unix command and check the "system time" for both parts.I'm using XGBoosterSetParam(h_booster, "nthread", "1"). This is to narrow the issue to the code and algorithm. Multi-threading increases performance, really. But this is not the reason to write code that decreases performance. The issue we are discussing is about algorithm, not about multi-threading.
I suspect (without looking at the source code) this is the difference between running calculation on a single row ('for' loop) versus optimised calculations using processor optimised libraries that are far superior when performed on matrices/vectors. Lookup "BLAS library" for an example of such an optimisation. I assume the c++ compiler will use BLAS or some kind of equivalent... my guess at least!
Good guess. Should be confirmed by XGBoost developers. But this is related to 2x predict() method slowdown. Also I see that one row matrix creation takes time and gives 2x slowdown too. Is it any normal reason for this?
I have data containing 883k instances. Also I have ready XGBooster. I predict all the instances as one matrix and all the instances one by one with separate prediction call. For separate prediction call I need to create separate matrix which contains one instance only. Here is my code:
Output:
What we see is that one by one prediction is 2x slower. Also matrix creation gives 2x overhead. It's 4x in total. Multi-threading is not an issue here because of nthread=1.
If matrix creation is so slow isn't it better just to use float[] representation of the instance? Why is pure prediction is 2x slower? How can existence of other instances in the matrix improve the performance of the prediction of one particular instance?
Is it possible to have fixes for the above problems in the visible future?