dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.36k stars 8.74k forks source link

Running predict in parallel #2274

Closed FelixNeutatz closed 6 years ago

FelixNeutatz commented 7 years ago

Hi everybody,

I am a big fan of parallel training in XGBoost. However, in my latest project I came to the point that prediction became the bottleneck. First, I tried to parallelize using naive Python multithreading, but I recognized that the model is not thread-safe as reported here: https://groups.google.com/forum/#!searchin/xgboost-user/predict$20thread%7Csort:relevance/xgboost-user/h0WCVGJUv_A/cFT3o1v9MSwJ

Is there some trick to do prediction in parallel anyways? (I don't want to setup Flink or Spark. I would appreciate to stay in the Python setup.)

Thank you for your help.

Best regards, Felix

P.S. This is a repost of https://groups.google.com/forum/#!topic/xgboost-user/ZqwOnioP0y8 Since I didn't get any answer there, I try it here again :)

FelixNeutatz commented 7 years ago

I solved the problem I transformed a really big matrix into the XG-Boost-Format everytime I predicted. This is why i took so long:

testdmat = xgb.DMatrix(feature_matrix)
y_pred = final_gb.predict(testdmat)
FelixNeutatz commented 6 years ago

Just for people who have a similar issue. It turns out that converting sparse matrices into xgb.DMatrix is the bottleneck (also see https://stackoverflow.com/questions/46934782/scipy-sparse-matrix-to-xgboost-dmatrix-conversion-extremely-slow). I could also not parallelize the conversion:

def convert(data):
    return xgb.DMatrix(data, nthread=-1)

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, l.shape[0], n):
        yield l[i:i + n]

def convert_in_parallel(data, jobs, model):
    data_chunks = chunks(data, jobs)

    pool = mp.Pool(processes=jobs)
    results = pool.map(convert, data_chunks)

    probability_predictions = []
    for r in results:
        probability_predictions.append(model.predict(r))
    print(probability_predictions)

Pickle has an issue here:

multiprocessing.pool.MaybeEncodingError: Error sending result: '[<xgboost.core.DMatrix object at 0x7fa5280bb240>, <xgboost.core.DMatrix object at 0x7fa52811a4e0>, ...]'. Reason: 'ValueError('ctypes objects containing pointers cannot be pickled',)'

Furthermore, i converted my matrix using data.todense(). This approach accelerates the conversion by a lot but in my case it hurts accuracy of the classifier (my guess here is that XGBoost can exploit sparse structure by differentiating between zeros and non-existing values).

So, for now, I have no idea how to achieve fast qualitative prediction for large sparse matrices. I would be happy if somebody can let me know how to make this happen.

Thank you for your help :)

hcho3 commented 6 years ago

@FelixNeutatz Can you check CPU usage and see if all cores are used? How many cores does your machine have? I notice that DMatrix constructor uses all threads for loading dense matrix (see XGDMatrixCreateFromMat_omp()) but uses only one thread for loading sparse matrix (see XGDMatrixCreateFromCSREx()).

hcho3 commented 6 years ago

@FelixNeutatz In near future, I'd like to submit a pull request to perform multi-threaded loading of sparse matrices. Can you tell me the dimension and density (% of nonzeros) of your input sparse matrix?

FelixNeutatz commented 6 years ago

Thanks @hcho3! That would be great. My input sparse matrix is not so sparse:

trivialfis commented 6 years ago

@hcho3 I went ahead and tried to implement the omp version of XGDMatrixCreateFromCSREx(). In short, it doesn't work. Due to input data contains NAN, indptr needs to be recomputed, which is very hard to be parallelized.

trivialfis commented 6 years ago

@hcho3 One way to do it is create a parallel scan which counts number of zeros for each row, I made a little benchmark to see how much performance gain I got in the situation where data does not contain NAN, for data shape mentioned above:

The convert time drops approximately from 4 sec to 2 sec after having 8 threads. If the input data is not of type np.float32, then the majority of time is spent on ctypes transforming data, in which case XGDMatrixCreateFromCSREx() is negligible.

hcho3 commented 6 years ago

@trivialfis Thanks for your experiment. It seems like the performance gain won't be worthwhile the effort. I suppose we should close this.