Closed gaw89 closed 7 years ago
This definitely looks like insufficient floating point precision. We do summation in single precision on the gpu. I will look into a fix in future.
Thanks for your response. I figured it had something to do with the precision. The problem is worst when using float64 and slightly less bad using float32 and float16. The best, as I said is with int.
Does this mean there's nothing I can do on my end (I have zero experience in C and CUDA)?
Do you think a reasonable solution in the meantime would be to develop the model on GPU for speed then run final training on CPU (hoping that the optimal hyperparameters, etc. are the same on float/CPU as on int/GPU)?
Are you using 'grow_gpu_hist' or 'grow_gpu'?
This kind of thing normally only happens on very extreme values. It can be good practice to do some kind of scaling or normalisation on your features regardless of which machine learning algorithm you use if this is the case.
Currently using 'grow_gpu'. My understanding was that with tree-based methods, feature scaling doesn't really make a difference (I've tried it before and don't recall experiencing any difference in XGBoost), but I'll give that a shot again to see if it has any impact. I'll also try 'grow_gpu_hist' to see if that has any impact.
I'll let you know what I come up with.
Sorry, you are correct, feature scaling will not make a difference. I meant to say label scaling. Try log scaling if the values are large.
No problem. I am not sure I follow about label scaling. Do you mean in a regression problem, take the log of the regression output? I am just doing binary classification in this case.
So it turns out 'grow_gpu_hist' is not much different performance-wise, but I didn't realize how big the speed boost would be - 5 seconds to train 100 rounds vs. 33 seconds with 'grow_gpu'!
GROW_GPU: [0] validation-logloss:0.683218 training-logloss:0.68192 [10] validation-logloss:0.600431 training-logloss:0.591257 [20] validation-logloss:0.540824 training-logloss:0.524287 [30] validation-logloss:0.497868 training-logloss:0.474415 [40] validation-logloss:0.464888 training-logloss:0.434754 [50] validation-logloss:0.440185 training-logloss:0.404325 [60] validation-logloss:0.420208 training-logloss:0.379215 [70] validation-logloss:0.404739 training-logloss:0.359366 [80] validation-logloss:0.392556 training-logloss:0.343219 [90] validation-logloss:0.382536 training-logloss:0.329614 [99] validation-logloss:0.375699 training-logloss:0.319473 Raw Validation Accuracy: 0.84666317522070922
GROW_GPU_HIST: [0] validation-logloss:0.683255 training-logloss:0.68194 [10] validation-logloss:0.594339 training-logloss:0.587754 [20] validation-logloss:0.53516 training-logloss:0.521548 [30] validation-logloss:0.49303 training-logloss:0.472418 [40] validation-logloss:0.461247 training-logloss:0.433685 [50] validation-logloss:0.436651 training-logloss:0.403764 [60] validation-logloss:0.417501 training-logloss:0.379227 [70] validation-logloss:0.402805 training-logloss:0.359669 [80] validation-logloss:0.391159 training-logloss:0.343627 [90] validation-logloss:0.381589 training-logloss:0.330329 [99] validation-logloss:0.374952 training-logloss:0.320207 Raw Validation Accuracy: 0.84699985036660186
Also, it appears that 'grow-gpu-hist' does not support the 'seed' parameter properly. Is that an issue, or is it to be expected?
Run 1: ('Raw Validation Accuracy:', 0.846102049977555) Run 2: ('Raw Validation Accuracy:', 0.84666317522070922) Run 3: ('Raw Validation Accuracy:', 0.84692503366751459)
param['seed'] = 123 for each run.
@RAMitchell, interestingly, this problem seems to have magically resolved itself overnight for me. I came into work this morning and ran it again, and these are the results:
GPU/float64 - Raw Validation Accuracy: 0.858061861839 CPU/float64 - Raw Validation Accuracy: 0.858847290272 GPU/int - Raw Validation Accuracy: 0.857463440177
Nothing in my code has changed since I ran this yesterday and posted above. Not sure what made the difference, but it appears to be resolved.
Still the GPU accuracy is slightly worse than the CPU one. But comparing with the 5x times faster efficiency, I think it is acceptable for me.
Environment info
Operating System: Windows 7 (build 7601, Service Pack 1) 64bit CUDA 8.0 cuDNN 5.1
Hardware: CPU: Xeon E5-1620 V4 GPU: Asus GTX 1080 ti FE
Compiler: Downloaded pre-compiled version from Guido Tapia on 5/12/17
Package used (python/R/jvm/C++): Python
xgboost
version used: .6If you are using python package, please provide
Steps to reproduce
This gist shows an example with dummy data:
https://gist.github.com/gaw89/7b0d68e5b79c9d26f00c8472edf99b28
Here's the output:
GPU ---------------------------- [0] validation-logloss:0.678408 training-logloss:0.677604 [10] validation-logloss:0.558674 training-logloss:0.55084 [20] validation-logloss:0.469776 training-logloss:0.456777 [30] validation-logloss:0.403235 training-logloss:0.385644 [40] validation-logloss:0.349319 training-logloss:0.328338 [50] validation-logloss:0.306017 training-logloss:0.282227 [60] validation-logloss:0.271768 training-logloss:0.245588 [70] validation-logloss:0.243736 training-logloss:0.215626 [80] validation-logloss:0.219791 training-logloss:0.19035 [90] validation-logloss:0.200407 training-logloss:0.169535 [99] validation-logloss:0.184985 training-logloss:0.153226 Raw Validation Accuracy: 0.96708
CPU ---------------------------- [0] validation-logloss:0.678401 training-logloss:0.677562 [10] validation-logloss:0.558419 training-logloss:0.550815 [20] validation-logloss:0.470883 training-logloss:0.45759 [30] validation-logloss:0.40312 training-logloss:0.385283 [40] validation-logloss:0.349083 training-logloss:0.32783 [50] validation-logloss:0.306609 training-logloss:0.282439 [60] validation-logloss:0.271177 training-logloss:0.244778 [70] validation-logloss:0.242388 training-logloss:0.214145 [80] validation-logloss:0.219018 training-logloss:0.189054 [90] validation-logloss:0.199262 training-logloss:0.168125 [99] validation-logloss:0.184332 training-logloss:0.152177 Raw Validation Accuracy: 0.96748
As you can see, CPU provides slightly better accuracy. However, when I try this on my real dataset (which I cannot include here), the difference is vast. In fact, the GPU logloss is all over the place.
GPU ---------------------------- [0] validation-logloss:20.5582 training-logloss:19.9119 [10] validation-logloss:8.30447 training-logloss:10.8549 [20] validation-logloss:6.82514 training-logloss:7.73837 [30] validation-logloss:19.178 training-logloss:15.8284 [40] validation-logloss:12.5911 training-logloss:10.5985 [50] validation-logloss:16.3701 training-logloss:12.7894 [60] validation-logloss:13.9307 training-logloss:10.9101 [70] validation-logloss:16.2863 training-logloss:12.2042 [80] validation-logloss:13.3689 training-logloss:10.2872 [90] validation-logloss:11.5154 training-logloss:9.0604 [99] validation-logloss:15.6321 training-logloss:12.9941 Raw Validation Accuracy: 0.462666467155
CPU ---------------------------- [0] validation-logloss:0.683447 training-logloss:0.681888 [10] validation-logloss:0.602152 training-logloss:0.592586 [20] validation-logloss:0.544172 training-logloss:0.525163 [30] validation-logloss:0.499751 training-logloss:0.475508 [40] validation-logloss:0.466776 training-logloss:0.435446 [50] validation-logloss:0.441168 training-logloss:0.40346 [60] validation-logloss:0.420259 training-logloss:0.378412 [70] validation-logloss:0.404089 training-logloss:0.358238 [80] validation-logloss:0.391373 training-logloss:0.342194 [90] validation-logloss:0.381069 training-logloss:0.328313 [99] validation-logloss:0.374225 training-logloss:0.318104 Raw Validation Accuracy: 0.847972467455
There are also substantial differences in feature importances between the various CPU/GPU, int/float versions of the model.
What have you tried?
GPU ---------------------------- [99] validation-logloss:0.375699 training-logloss:0.319473 Raw Validation Accuracy: 0.846663175221
CPU ---------------------------- [99] validation-logloss:0.375026 training-logloss:0.319558 Raw Validation Accuracy: 0.84587759988
Is this type of thing to be expected with differences in how GPU and CPU perform numerical operations? Is there any way to get comparable performance on GPU?