Separate predictions are 4x slower than bulk predictions

goupdown commented 8 years ago

I have data containing 883k instances. Also I have ready XGBooster. I predict all the instances as one matrix and all the instances one by one with separate prediction call. For separate prediction call I need to create separate matrix which contains one instance only. Here is my code:

int main() {
    BoosterHandle h_booster;
    XGBoosterCreate(0, 0, &h_booster);
    XGBoosterLoadModel(h_booster, "C:/ro/server33/ro_cluster/PrXgb.save/f.sit1.2.obj");
    XGBoosterSetParam(h_booster, "nthread", "1");

    const int instLen = 22;
    int dataLen = 883567;
    DMatrixHandle h_test;
    XGDMatrixCreateFromFile("C:/ro/server33/ro_cluster/cache_data/st3.sit1.h2", 0, &h_test);

    LONG st = time();
    bst_ulong out_len;
    const float *out;
    XGBoosterPredict((void*)h_booster, h_test, 0, 0, &out_len, &out);
    printf("Bulk predict: %dms\n", time() - st);

    LONG tp = 0;
    st = time();
    for (int j = 0; j < dataLen; j++) {
        // Just take first line of the data
        float inst[] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 2, 2, 0, 0, 0, 0, 0, };

        DMatrixHandle hm;
        XGDMatrixCreateFromMat((float*)inst, 1, instLen, -1, &hm);
        LONG stp = time();
        XGBoosterPredict((void*)h_booster, hm, 0, 0, &out_len, &out);
        tp += time()-stp;
    }
    printf("1 by 1 predict total %dms, pure %dms\n", time() - st, tp);

    return 0;
}

Output:

C:\>XGBoostTest.exe
[23:19:25] 883566x23 matrix with 19438452 entries loaded from C:/ro/server33/ro_cluster/cache_data/st3.sit1.h2
Bulk predict: 550ms
1 by 1 predict total 2373ms, pure 945ms

What we see is that one by one prediction is 2x slower. Also matrix creation gives 2x overhead. It's 4x in total. Multi-threading is not an issue here because of nthread=1.

If matrix creation is so slow isn't it better just to use float[] representation of the instance? Why is pure prediction is 2x slower? How can existence of other instances in the matrix improve the performance of the prediction of one particular instance?

Is it possible to have fixes for the above problems in the visible future?

qingpingswiftsword commented 8 years ago

Could you show your bulk prediction code?

goupdown commented 8 years ago

Bulk prediction code is in the original message above right before the line: printf("Bulk predict: %dms\n", time() - st);

ungil commented 8 years ago

Why is pure prediction is 2x slower? How can existence of other instances in the matrix improve the performance of the prediction of one particular instance?

Aren't you calling XGBoosterPredict once in one case and 883567 times in the other? I don't think it's surprising that making a single prediction on one million rows of data is more efficient than making one million separate predictions on one single row each time. What were your expectations?

qingpingswiftsword commented 8 years ago

Is there anyone know that how xgoost do prediction in bulk mode? I read http://scikit-learn.org/stable/modules/computational_performance.html , for logistic regression or SVM, it is easy to think the prediciton in bulk in matrix computation way. It is interesting to know how to do bulk prediction in tree mode.

goupdown commented 8 years ago

Aren't you calling XGBoosterPredict once in one case and 883567 times in the other? I don't think it's surprising that making a single prediction on one million rows of data is more efficient than making one million separate predictions on one single row each time. What were your expectations?

My expectation is that two codes below do the same and with the same performance.

Code 1:

int x = 0;
for (int i = 0; i < 1000000; i++) {
    x = x + 1;
}

Code 2:

int x = 0;
void inc() {
    x = x + 1;
}
for (int i = 0; i < 1000000; i++) {
    inc();
}

Why do you think function call should take meaningful time?

My guess is that XGBoost does some preparations inside predict() function which are not meaningful in case of 1mln matrix prediction, but decrease performance in case of single prediction. Production use cases are single usually. Imagine I need to recognize faces of the people entering the building or give an answer whether to approve credit for some person. There is no 1mln data matrix in advance in the real life.

qingpingswiftsword commented 8 years ago

I think that case 1 and case 2 are different , if they have equal performance. I guessed that C++ compiler have done the opimisation.

goupdown commented 8 years ago

I guessed that C++ compiler have done the opimisation

Yes it could. But this optimization will not change the performance significantly.

Why did not compiler do the same optimization in case of XGBoost predict() method? Actually this is the root of my original question. To do the prediction XGBoost needs to go thru several trees and sum the result (roughly speaking). Why is doing this for 1mln instances matrix is faster than for 1 instance matrix 1mln times? In any case for each instance we have float[] as an input and the same trees to go thru.

Please do not say that function call takes time in C++. This is not relevant for our case.

ungil commented 8 years ago

After jumping through several hoops, you end up running a loop over the rows in the matrix. The extra time comes from the overhead of getting to that point one million times instead of just once. There is no need to guess what happens, you can see the code.

goupdown commented 8 years ago

After jumping through several hoops, you end up running a loop over the rows in the matrix. The extra time comes from the overhead of getting to that point one million times instead of just once. There is no need to guess what happens, you can see the code.

Yes you are right. And my original issue is that XGBoost code should be changed to avoid this single predict() function call overhead. Do you think this is possible?

Actually there two steps:

Avoid matrix creation overhead (use float[]?)
Avoid predict() call overhead

AbdealiLoKo commented 8 years ago

Are you using an OMP enabled version ? Internally I assume xgboost uses OMP to run the prediction in parallel on the rows. While your for loop is written without OMP and uses only 1 thread. Now, if you have 4 cores ... it would be 4 times slower.

2 tests to understand better:

Compile xgboost with USE_OPENMP=0 ( https://github.com/dmlc/xgboost/blob/master/make/config.mk#L30 )
Use the time unix command and check the "system time" for both parts.

goupdown commented 8 years ago

I'm using XGBoosterSetParam(h_booster, "nthread", "1"). This is to narrow the issue to the code and algorithm. Multi-threading increases performance, really. But this is not the reason to write code that decreases performance. The issue we are discussing is about algorithm, not about multi-threading.

JohnStott commented 8 years ago

I suspect (without looking at the source code) this is the difference between running calculation on a single row ('for' loop) versus optimised calculations using processor optimised libraries that are far superior when performed on matrices/vectors. Lookup "BLAS library" for an example of such an optimisation. I assume the c++ compiler will use BLAS or some kind of equivalent... my guess at least!

goupdown commented 8 years ago

Good guess. Should be confirmed by XGBoost developers. But this is related to 2x predict() method slowdown. Also I see that one row matrix creation takes time and gives 2x slowdown too. Is it any normal reason for this?

dmlc / xgboost

Separate predictions are 4x slower than bulk predictions #1756