Azure / fast_retraining

Show how to perform fast retraining with LightGBM in different business cases
MIT License
54 stars 15 forks source link

XGBoost GPU benchmarks #62

Closed RAMitchell closed 7 years ago

RAMitchell commented 7 years ago

Hi, I am the author of the XGBoost GPU algorithms.

Your benchmarks of my GPU hist algorithm are simply running on the CPU. The reason for this is the 'tree_method':'hist' parameter is overriding the selection of the GPU updater. This was fixed some time ago but it seems you are using an older commit. The correct usage would now be to set 'tree_method':'gpu_hist'. I would appreciate if you can update your benchmarks, I think you might find my algorithm far more competitive.

I also noticed that the XGBoost CPU hist algorithm has not had the number of bins set correctly, so you would be comparing 256 bins for XGBoost against 63 bins for LightGBM. This was due to a mistake in our documentation regarding the naming of the parameter that I have noted in dmlc/xgboost#2567.

Thanks Rory

guolinke commented 7 years ago

@RAMitchell Refer to search result, it seems only one dataset (04_PlanetKaggle.ipynb) has the max_bin issue: https://github.com/Azure/fast_retraining/search?utf8=%E2%9C%93&q=max_bins&type= .

BTW, i think the max_bin doesn't have much impact on training time when running in CPU. But we can fix it.

@miguelgfierro I think the XGBoost hist GPU result can be updated, it truly is almost the same as CPU hist .

miguelgfierro commented 7 years ago

@RAMitchell thank you for the feedback, I will make sure that we run the experiments again and will update both repo and blog post. To be honest, I was surprised by the results of xgb hist, so I appreciate your feedback.

Could you please tell me the exact commit of XGBoost that we could use?

Also, I will create a new branch to fix this issue. I would really appreciate if you could review that I'm using the correct parameters for xgb hist.

guolinke commented 7 years ago

@RAMitchell BTW, should we use tree_method:gpu_exact for the xgb_gpu ?

And it seems the gpu_hist doesn't support loss_guide tree growth (which is slower but more accuracy) ? if yes, it is hard to have a apple to apple comparison.

RAMitchell commented 7 years ago

This commit should be appropriate 48f3003302c323475af957b164ac24a564babb6c. It is from June 29 and contains the interface changes.

I agree that the max_bin parameter should not significantly affect benchmarks. Good to be sure though :).

Yes tree_method:gpu_exact would be appropriate. Note that if you use the above commit you will get a newer version of the exact algorithm modified by some Nvidia people. The speed might be marginally higher and the memory usage also marginally higher.

My 'gpu_hist' algorithm does not support 'loss_guide'. It will be hard to have an apples to apples comparison but you have the same problem benchmarking the original XGBoost algorithm against LightGBM. I think the major pitfall of benchmarking accuracy between these two algorithms is limiting by number of leaves. This would allow LightGBM to reach far greater depths and is not really fair to XGBoost.

It is not very obvious how to choose these parameters unfortunately. All I can say is that the max depth of 6 in XGBoost is very commonly insufficient to get the best results.

RAMitchell commented 7 years ago

I should also mention that it is not possible to reproduce your experiments because the GPU VMs do not appear to be available to the public until December.

Edit: I am mistaken this was December 2016? I was still not able to find the VM in Azure.

guolinke commented 7 years ago

@RAMitchell not all regions have the NV VM, you can find the right region here: https://azure.microsoft.com/en-us/regions/services/

We know the original XGBoost has this problem, so we have xgb_hist+loss_guide for the accuracy comparison. And the timing result in such setting is also comparable. It still have a apple to apple comparison in CPU.

But in GPU, all XGB only support depth-wise. We know it is less accuracy, but growing shallow trees is also faster. As result, xgb_hist GPU will have advantage in speed, but loss on accuracy.

miguelgfierro commented 7 years ago

@RAMitchell with the commit you mentioned I get an error:

[ 96%] Building NVCC (Device) object CMakeFiles/cuda_compile.dir/plugin/updater_gpu/src/cuda_compile_generated_updater_gpu.cu.o
Error copying file (if different) from "/home/hoaphumanoid/installer/xgboost_rory/build/CMakeFiles/cuda_compile.dir/plugin/updater_gpu/src/cuda_compile_generated_updater_gpu.cu.o.depend.tmp" to "/home/hoaphumanoid/installer/xgboost_rory/build/CMakeFiles/cuda_compile.dir/plugin/updater_gpu/src/cuda_compile_generated_updater_gpu.cu.o.depend".
CMake Error at cuda_compile_generated_updater_gpu.cu.o.cmake:232 (message):
  Error generating
  /home/hoaphumanoid/installer/xgboost_rory/build/CMakeFiles/cuda_compile.dir/plugin/updater_gpu/src/./cuda_compile_generated_updater_gpu.cu.o

CMakeFiles/runxgboost.dir/build.make:63: recipe for target 'CMakeFiles/cuda_compile.dir/plugin/updater_gpu/src/cuda_compile_generated_updater_gpu.cu.o' failed
make[2]: *** [CMakeFiles/cuda_compile.dir/plugin/updater_gpu/src/cuda_compile_generated_updater_gpu.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
/home/hoaphumanoid/installer/xgboost_rory/plugin/updater_gpu/src/exact/../device_helpers.cuh: In member function ‘void dh::Timer::printElapsed(std::__cxx11::string)’:
/home/hoaphumanoid/installer/xgboost_rory/plugin/updater_gpu/src/exact/../device_helpers.cuh:226:54: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t {aka long int}’ [-Wformat=]
/home/hoaphumanoid/installer/xgboost_rory/plugin/updater_gpu/src/device_helpers.cuh: In member function ‘void dh::Timer::printElapsed(std::__cxx11::string)’:
/home/hoaphumanoid/installer/xgboost_rory/plugin/updater_gpu/src/device_helpers.cuh:226:54: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t {aka long int}’ [-Wformat=]
/home/hoaphumanoid/installer/xgboost_rory/plugin/updater_gpu/src/device_helpers.cuh: In member function ‘void dh::Timer::printElapsed(std::__cxx11::string)’:
/home/hoaphumanoid/installer/xgboost_rory/plugin/updater_gpu/src/device_helpers.cuh:226:54: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t {aka long int}’ [-Wformat=]
Scanning dependencies of target xgboost
[ 98%] Linking CXX shared library ../lib/libxgboost.so
[ 98%] Built target xgboost
CMakeFiles/Makefile2:184: recipe for target 'CMakeFiles/runxgboost.dir/all' failed
make[1]: *** [CMakeFiles/runxgboost.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

What do you think the problem is?

Alternatively, I was able to compile in the first run commit 4eb255262fd8a7172815656c9eb3c148aa0d1e68 from Jul 18, this is more recent that the commit you mentioned. Will you be happy if I use this commit?

RAMitchell commented 7 years ago

We made a lot of improvements to cmake over that time so its hard to say what the issue is. The safest thing to do is use the most recent. I was trying to give you a commit that was a bit closer to what you had before for fairness but I don't think it matters too much.

miguelgfierro commented 7 years ago

@RAMitchell I installed the more recent commit and run the experiment airline_GPU with these parameters:

xgb_hist_params = {'max_depth':0, 
                  'max_leaves':2**8, 
                  'objective':'binary:logistic', 
                  'min_child_weight':30, 
                  'eta':0.1, 
                  'scale_pos_weight':2, 
                  'gamma':0.1, 
                  'reg_lamda':1, 
                  'subsample':1,
                  'tree_method':'gpu_hist', 
                  'updater':'grow_gpu_hist'
                 }

I got this error:

XGBoostError                              Traceback (most recent call last)
<ipython-input-22-797306538b1f> in <module>()
----> 1 xgb_hist_clf_pipeline, t_train = train_xgboost(xgb_hist_params, X_train, y_train, num_rounds)

<ipython-input-17-500123296f4f> in train_xgboost(parameters, X, y, num_rounds)
      2     ddata = xgb.DMatrix(data=X, label=y)
      3     with Timer() as t:
----> 4         clf = xgb.train(parameters, ddata, num_boost_round=num_rounds)
      5     return clf, t.interval

/anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, xgb_model, callbacks, learning_rates)
    202                            evals=evals,
    203                            obj=obj, feval=feval,
--> 204                            xgb_model=xgb_model, callbacks=callbacks)
    205 
    206 

/anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
     72         # Skip the first update if it is a recovery step.
     73         if version % 2 == 0:
---> 74             bst.update(dtrain, i, obj)
     75             bst.save_rabit_checkpoint()
     76             version += 1

/anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/core.py in update(self, dtrain, iteration, fobj)
    825         if fobj is None:
    826             _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, ctypes.c_int(iteration),
--> 827                                                     dtrain.handle))
    828         else:
    829             pred = self.predict(dtrain)

/anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/core.py in _check_call(ret)
    128     """
    129     if ret != 0:
--> 130         raise XGBoostError(_LIB.XGBGetLastError())
    131 
    132 

XGBoostError: b'[13:59:09] /home/hoaphumanoid/installer/xgboost_rory/plugin/updater_gpu/src/updater_gpu.cu:66: GPU plugin exception: [13:59:09] /home/hoaphumanoid/installer/xgboost_rory/plugin/updater_gpu/src/gpu_hist_builder.cu:47: Check failed: !data.empty() DeviceHist must be externally allocated\n\nStack trace returned 10 entries:\n[bt] (0) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f9286b89adc]\n[bt] (1) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN7xgboost4tree10DeviceHist4InitEi+0x8c) [0x7f9286d92c6c]\n[bt] (2) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN7xgboost4tree14GPUHistBuilder8InitDataERKSt6vectorINS_9bst_gpairESaIS3_EERNS_7DMatrixERKNS_7RegTreeE+0x1e8d) [0x7f9286d9a6ad]\n[bt] (3) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN7xgboost4tree14GPUHistBuilder6UpdateERKSt6vectorINS_9bst_gpairESaIS3_EEPNS_7DMatrixEPNS_7RegTreeE+0x2e) [0x7f9286d9c07e]\n[bt] (4) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN7xgboost4tree12GPUHistMaker6UpdateERKSt6vectorINS_9bst_gpairESaIS3_EEPNS_7DMatrixERKS2_IPNS_7RegTreeESaISB_EE+0x206) [0x7f9286d7c1c6]\n[bt] (5) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesERKSt6vectorINS_9bst_gpairESaIS3_EEPNS_7DMatrixEiPS2_ISt10unique_ptrINS_7RegTreeESt14default_deleteISB_EESaISE_EE+0x8c3) [0x7f9286c04633]\n[bt] (6) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPSt6vectorINS_9bst_gpairESaIS5_EEPNS_11ObjFunctionE+0xa60) [0x7f9286c058f0]\n[bt] (7) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiPNS_7DMatrixE+0x22b) [0x7f9286c1f46b]\n[bt] (8) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(XGBoosterUpdateOneIter+0x27) [0x7f9286bdcd77]\n[bt] (9) /anaconda/envs/xgb_gpu/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f92c2785540]\n\n\n\nStack trace returned 10 entries:\n[bt] (0) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f9286b89adc]\n[bt] (1) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN7xgboost4tree12GPUHistMaker6UpdateERKSt6vectorINS_9bst_gpairESaIS3_EEPNS_7DMatrixERKS2_IPNS_7RegTreeESaISB_EE+0x3ba) [0x7f9286d7c37a]\n[bt] (2) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesERKSt6vectorINS_9bst_gpairESaIS3_EEPNS_7DMatrixEiPS2_ISt10unique_ptrINS_7RegTreeESt14default_deleteISB_EESaISE_EE+0x8c3) [0x7f9286c04633]\n[bt] (3) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPSt6vectorINS_9bst_gpairESaIS5_EEPNS_11ObjFunctionE+0xa60) [0x7f9286c058f0]\n[bt] (4) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiPNS_7DMatrixE+0x22b) [0x7f9286c1f46b]\n[bt] (5) /anaconda/envs/xgb_gpu/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/libxgboost.so(XGBoosterUpdateOneIter+0x27) [0x7f9286bdcd77]\n[bt] (6) /anaconda/envs/xgb_gpu/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f92c2785540]\n[bt] (7) /anaconda/envs/xgb_gpu/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x1f5) [0x7f92c2784ce5]\n[bt] (8) /anaconda/envs/xgb_gpu/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x3dc) [0x7f92c277c7fc]\n[bt] (9) /anaconda/envs/xgb_gpu/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(+0x9d73) [0x7f92c2774d73]\n'
RAMitchell commented 7 years ago

Max_depth should not be 0. Num leaves is unused - only set the depth. Don't set the updater, only tree_method.

miguelgfierro commented 7 years ago

I'm using these parameters:

xgb_hist_params = {'max_depth':8, 
                  'objective':'binary:logistic', 
                  'min_child_weight':30, 
                  'eta':0.1, 
                  'scale_pos_weight':2, 
                  'gamma':0.1, 
                  'reg_lamda':1, 
                  'subsample':1,
                  'tree_method':'gpu_hist', 
                 }

and I got the following result for the big dataset: Computed XGBoost hist with 1e+08 samples in 1118.158s with AUC=0.603 Computed LightGBM with 1e+08 samples in 1042.223s with AUC=0.857

The time is much better than the original version, this computation corresponds in the post to airline subsample size 100M, 500 rounds, xgb_hist. Then the time was 2098s and AUC=0.856.

Would the parameters be ok? Why there is such a difference in the AUC?

NOTE: I'm away for a week, so I might not be able to follow up with this conversation until the week after

RAMitchell commented 7 years ago

AUC seems wrong. See here for a reference: https://github.com/szilard/GBM-perf

miguelgfierro commented 7 years ago

I did a fresh run in another VM with the commit you told me. I run 01_airline_GPU.ipynb. Using these parameters for xgboost_hist and Lightgbm:

xgb_hist_params = {'max_depth':8, 
                  'objective':'binary:logistic', 
                  'min_child_weight':30, 
                  'eta':0.1, 
                  'scale_pos_weight':2, 
                  'gamma':0.1, 
                  'reg_lamda':1, 
                  'subsample':1,
                  'tree_method':'gpu_hist', 
                 }

lgbm_params = {'num_leaves': 2**8,
               'learning_rate': 0.1,
               'scale_pos_weight': 2,
               'min_split_gain': 0.1,
               'min_child_weight': 30,
               'reg_lambda': 1,
               'subsample': 1,
               'objective':'binary',
               'device': 'gpu',
               'task': 'train'
              }

I got something similar to the other results I reported, a speed up in the algorithm but a very low performance:

{
    "lgbm": {
        "performance": {
            "AUC": 0.8422509285268153,
            "Accuracy": 0.7257980036677117,
            "F1": 0.7462596869605246,
            "Precision": 0.6653401839514281,
            "Recall": 0.849587589195607
        },
        "test_time": 46.53701155199997,
        "train_time": 634.3887890320002
    },
    "xgb_hist": {
        "performance": {
            "AUC": 0.5708528394163612,
            "Accuracy": 0.5599637939038674,
            "F1": 0.6275373675179221,
            "Precision": 0.5244543226440088,
            "Recall": 0.7810562854618314
        },
        "test_time": 16.03199911200045,
        "train_time": 613.3178137940004
    }
}

In the part of sampling different data sizes I got these results:

Computed XGBoost hist with 1e+07 samples in 134.728s with AUC=0.839
Computed LightGBM with 1e+07 samples in 139.618s with AUC=0.855

Computed XGBoost hist with 1e+08 samples in 1263.184s with AUC=0.628
Computed LightGBM with 1e+08 samples in 948.931s with AUC=0.857

In case you want to replicate the results, you have all the code available. If in the future the xgboost team fix the issue and improve the performance, you could repeat the experiments and do a new comparison. We will be happy to see it. For now I'm going to close this issue.