Peter-Devine / test_repo_0

A repo purely for testing Github API functions
0 stars 0 forks source link

speed for a single record #3

Open Peter-Devine opened 4 years ago

Peter-Devine commented 4 years ago

Did you know about https://github.com/dmlc/xgboost/issues/1849#issuecomment-266716752

Apparently xgboost4j is quicker for batch predictions in the current version than this library. Do you have a test which compares predicting a single new value and not 200k values? As described in the linked xgboost issue xgboost4j,s api is only supporting batch mode. What about your library?

I have tested on a dataset (containing 200,000 data) on spark. The xgboost4j-spark cost 1775736 milliseconds containing implicit data transformations. xgboost-predictor-java cost 4620104 milliseconds containing data transformations and 2907550 milliseconds without transformations. I think xgboost4j's prediction on a batch is faster and I will keep using xgboost4j.

Peter-Devine commented 4 years ago

Any conclusion here?

Peter-Devine commented 4 years ago

Looks like benchmark results posted in the README.md file is quite misleading, they claim that current JVM version is few orders of magnitude faster than xgboost4j, and if you would run benchmark you will be able to get similar results. However, if you will dig deeper you would figure out that most of the time xgboost4j spend on creating DMatrix object - which is not in sparse format (by default) and has huge size: 100x100000. I believe that using sparse matrix format would boost performance. I've checked benchmark with DMatrix of size 80x100 - more suitable for my case and performance of xgboost4j was better (30-40% faster).

Peter-Devine commented 4 years ago

I have made a benchmark on some of the different libraries available, among them XGBoost4j and XGBoost-Predictor, you can take a look here if you are interested.