Closed kendu605 closed 6 years ago
@kendu605 can you tell the operating system, compiler, and the CPU usage when training using top
?
@phunterlau ,I use win7 64-bit, and used visual studio 2013 to compile the xgboost, my PC have a 8 cores cpu
@kendu605 what is your CPU usage while training the model?
@phunterlau , the CPU usage was around (20%,35%,%48,%60) with the parameter 'nthread' change from 1~4
kendu605, I find that compiling xgboost using the MingW64 compiler creates a dll that is about 50% faster than when I compile directly from ms visual studio. Could this be what you are seeing? Could your initial 0.4 build have been produced via a different compiler?
I tried different settings in visual studio such as: 1) ensuring compiling using "release" mode. project->properties->configuration properties C/C++ 2) -> Optimzation, set "Favor Size Or Speed" to "Speed" 3) -> Code Generation, played around with "Floating Point Model" settings. unfortunately this made little difference. The last thing I tried was: 4) -> Enable Enhanced Instruction Set, this is where I think my problem may lie, my cpu's support upto SSE4.1 however the max MS allow is SSE2. Unfortunately there is no AVX on my cpus so can't use those settings. I am wondering whether MingW64 does have support for SSE4.1 and that is why it runs faster when compiled with this? The only alternative reason I can think is that MingW is perhaps far superior to the MS compiler (for XGBoost). If this is true, I wonder what other compiler's would benchmark as? Unfortunately I can't really find much info on google with regards to compiler benchmarks, the consensus opinion (from forums) seems to be anecdotally that they should all be of similar performance!
Perhaps you could try points 1-4 see if this makes any difference for your machine? And also a MingW compile. I'd be very interested in your results.
@JohnStott ,thanks a lot for your detailed explanation and trying so many tests. I used VS 2013 win64 to complie for both xgboost 0.4 and 0.6, I have tried your suggestions you provided, but got the same results as I posted before. From your description, I think it may be that VS 2013 complier can't support all features of xgboost 0.6, so that it slower than 0.4. I will try to re-complie it via VS 2015 or MingW64, and then to see if I can get better performance on version 0.6. Thanks again for your advices.
I forgot to mention that I was using VS 2015 when I tried the above (and python 2.7 - I've yet to look at python 3+). I would suggest trying MingW64 first as you'll probably get same results in VS 2015!?
@JohnStott, I just used VS 2015 building xgboost 0.6 with your settings proposal, but unfortunately, I got the same result when running the example code with version 0.6 on python3.4, which is still slower than running version 0.4 on python3.4. I tried many times to compile it using MingW64, but all fail, I use TDM-MinGW, with the commands provided by xgboost official website, as "alias make='mingw32-make'; cp make/mingw64.mk config.mk; make -j4"
When you say it fails, I assume you mean there are errors during compilation? Have you tried compiling the other libraries first? i.e., this is what I use:
cd dmlc-core
make -j4
cd ../rabit
make lib/librabit_empty.a -j4
cd ..
cp make/mingw64.mk config.mk
make -j4
I also use this version: https://sourceforge.net/projects/mingw-w64/
(when installing, choose "Achitecture" : x86_64 (assuming you're using a 64-bit machine))
I have downloaded the new mingw-w64, but got errors below when compile with code: cd dmlc-core; make -j4 'g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o io.o src/io.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o data.o src/data.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o config.o src/config.cc
src/io.cc:1:0: warning: -fPIC ignored for target (all code is position independe
nt) [enabled by default]
src/data.cc:1:0: warning: -fPIC ignored for target (all code is position indepen
dent) [enabled by default]
src/config.cc:1:0: warning: -fPIC ignored for target (all code is position indep
endent) [enabled by default]
In file included from src/data/././text_parser.h:11:0,
from src/data/./libsvm_parser.h:13,
from src/data/disk_row_iter.h:19,
from src/data.cc:12:
include/dmlc/omp.h:9:17: fatal error: omp.h: No such file or directory
compilation terminated.
make: * [Makefile:83: data.o] Error 1
make: * Waiting for unfinished jobs....
In file included from src/io/cached_input_split.h:16:0,
from src/io.cc:13:
include/dmlc/threadediter.h:210:3: error: 'thread' in namespace 'std' does not n
ame a type
include/dmlc/threadediter.h:216:3: error: 'mutex' in namespace 'std' does not na
me a type
include/dmlc/threadediter.h:222:3: error: 'condition_variable' in namespace 'std
' does not name a type
include/dmlc/threadediter.h:224:3: error: 'condition_variable' in namespace 'std
' does not name a type
include/dmlc/threadediter.h: In constructor 'dmlc::ThreadedIter
include/dmlc/omp.h:9:17: fatal error: omp.h: No such file or directory compilation terminated.
^seems to be the culprit!? Maybe try again with a totally fresh clone of xgboost?
I cloned the new source, but still no help. Don't konw where is wrong.
It seems that you are still using an (old?) makefile because if you look towards the bottom of:
https://github.com/dmlc/dmlc-core/blob/f35f14f30835af238257b979cc1fac3e41ff3291/Makefile
you will see:
line_split.o: src/io/line_split.cc recordio_split.o: src/io/recordio_split.cc input_split_base.o: src/io/input_split_base.cc hdfs_filesys.o: src/io/hdfs_filesys.cc s3_filesys.o: src/io/s3_filesys.cc azure_filesys.o: src/io/azure_filesys.cc local_filesys.o: src/io/local_filesys.cc io.o: src/io.cc data.o: src/data.cc recordio.o: src/recordio.cc config.o: src/config.cc
when I compile I get: $ mingw32-make -j4 g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o line_split.o src/io/li ne_split.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o recordio_split.o src/i o/recordio_split.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o input_split_base.o src /io/input_split_base.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o io.o src/io.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o local_filesys.o src/io /local_filesys.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o data.o src/data.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o recordio.o src/recordi o.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o config.o src/config.cc ar cr libdmlc.a line_split.o recordio_split.o input_split_base.o io.o local_file sys.o data.o recordio.o config.o
Notice the order is the same!
Yours starts with src/io.cc which doesn't adhere to makefile.
That would be my guess anyway.
actually, I only checked the first couple of lines, it seems the order of mine is not exactly the same. Sorry I'm not sure? Maybe someone else can help?
Maybe check your makefile is identical to above link just to be sure?
...just had another quick look, line 33 in makefile (see link above) you will see: OBJ=line_split.o recordio_split.o input_split_base.o io.o local_filesys.o data.o recordio.o config.o
which is the same as the order, I think I understand makefile now. I assume your line 33 is different, hence your bad output!
Hi JohnStott, I copied the makefile from source and try to make again, but still nothing better happen. Besides MinGW64, do I need to install other softwares to compile it successfully? As stated in my original post, I can complie xgboost successfully both on version 0.6 and 0.4 with VS2013 or VS2015, but what the confusing thing is version 0.6 runs much slower than 0.4
I, originally, was thinking that there is something perhaps in version 0.6 that VS doesn't like, or at least affects the efficiency somewhere due to my observations between speed of MingW64 and VS compilations even after manually optimising VS settings.
I imagine this is pretty frustrating! If I were you and you haven't already tried the following I'd give it a go:
1) rename the root directory for each of your existing xgboost directories.
2) re-download xgboost and place into a directory with a new never used before name.
3) try compiling as per discussed previously.
if different errors then I'd try re-downloading xgboost and place into a directory with the same name as the very first xgboost you ever used and compiled.
Basically, all of the above is to check that new clone's don't end up pointing to old files (in case settings get set in registry etc)
I see a similar difference between versions building XGBoost models in R on Windows.
Details: Classification model from training set with 145 categories, 11K rows, 260 continuous features Model contains 100 trees Time to build with xgboost 0.4-3 on R 3.1.2: 10 minutes Time to build with xgboost 0.6-4 on R 3.3.2: 51 minutes
I did not build the xgboost package myself but just installed the download from CRAN.
Has anyone yet come up with a good explanation for this or a workaround other than simply using the older version?
I can confirm a similar speed degradation (~10x slowdown) when comparing the recent xgboost to 0.4-3 (on both Windows and Linux) using a simulated dataset with the same parameters as @dana33 has reported. Will need to take a deeper look...
one difference might be the switch of default missing from 0->NA? Which makes the 0 being enumerated, when the matrix is passed in as dense matrix. Try the following to confirm
dmat = xgb.DMatrix(data, missing=0)
@tqchen, thank you for the suggestion. I tried your suggestion, but it had no effect on the speed.
In case it matters, here is how I construct my data matrix:
Before:
x <- xgb.DMatrix(sparse.model.matrix(~.-1, data=xy[-1]))
After:
x <- xgb.DMatrix(sparse.model.matrix(~.-1, data=xy[-1]), missing=0)
(The data come in as a dense data.frame xy containing both the features and the response.)
The time to build the model in R 3.3.2 with xgboost 0.6-4 is the same in both cases, and still much slower compared to R 3.1.2 with xgboost 0.4-3.
P.S. The reason for the call to sparse.model.matrix is to do one-hot encoding for any factors in the input data. In this particular case, all features are continuous, so I was able to try the following:
x <- xgb.DMatrix(data=as.matrix(xy[-1]), missing=0)
Unfortunately, this still did not improve the speed.
@tqchen I think I see the issue: large amount of time is spent in prediction cache updates at https://github.com/dmlc/xgboost/blob/9fb46e2c5efbb7ea7bb0cbb0f815dbdc9b720177/src/gbm/gbtree.cc#L475 This compounds to nclass^2 complexity, since it is done for each and every separately committed tree, and PredValue is called for all output groups within PredLoopSpecalize.
@khotilov, thank you for the detective work. Do you know how this is done differently in version 0.4-3 to make it so much faster?
@dana33 That caching mechanism was introduced in https://github.com/dmlc/xgboost/commit/ecec5f7959cbe37a14f0ef83c9736f9c2a9490dc#diff-36e32f8e52bbd405d8cc60e601c9ae41, which was way after 0.4-3.
@khotilov Can you confirm this by do a bit timing, around the UpdateCache, as well as Tree growing between 0.4-3 and 0.6? Since the predictive cache was in 0.4-3, except that it is lazily updated, instead of being updated eagerly
@tqchen I did actually find it by narrowing down where the time was spent in the current code. E.g., in a single boosting iteration, building the trees part was taking ~15sec, and then ~50sec was spent to update the cache for training data.
I have also noticed a performance drop on my machine between xgboost version 0.4.x and 0.6.x In case anyone can find it useful and maybe shed some light on the problem, I'm sharing my results. Here is the R source code which i have used:
require(xgboost)
require(data.table)
require(dplyr)
#data: https://www.kaggle.com/c/otto-group-product-classification-challenge/data
train.csv <- fread('data/otto-train.csv', header = T, stringsAsFactors = F)
test.csv <- fread('data/otto-test.csv', header = T, stringsAsFactors = F)
x <- train.csv %>% select(-id, -target) %>% sapply(as.numeric) %>% as.matrix
y <- train.csv$target %>% factor() %>% as.integer() %>% as.matrix %>% -1
param <- list("objective" = "multi:softprob",
"eval_metric" = "mlogloss",
"num_class" = length(unique(y)) )
ptm <- proc.time()
bst = xgboost(param=param, data = x, label = y, nrounds=500, verbose = 0, nthread=8)
training.time <- proc.time() - ptm
packageVersion("xgboost")
training.time
and timing results; three tests for each combination of xgboost version and thread count:
nthread = 8
xgboost user system elapsed
0.4.3 656.376 0.380 83.414
646.200 0.264 81.534
653.816 0.300 82.670
0.4.4 648.556 0.256 81.853
647.780 0.292 81.760
665.864 0.548 84.708
0.6.0 1742.980 0.736 224.632
1743.568 0.804 223.994
1781.148 1.636 230.141
0.6.2 1725.600 1.740 222.746
1822.628 1.076 236.612
1591.916 0.792 201.286
0.6.3 1570.628 0.444 197.503
1577.364 0.636 198.558
1569.276 0.664 197.343
0.6.4 1614.984 1.196 205.332
1653.092 0.772 210.690
1688.656 0.584 215.723
nthread = 4
xgboost user system elapsed
0.4.3 403.780 0.168 101.074
407.468 0.276 102.014
406.576 0.272 101.821
0.4.4 407.160 0.208 101.904
411.816 0.196 103.105
403.968 0.156 101.117
0.6.0 989.728 1.448 247.938
1022.488 1.228 256.089
1000.572 0.848 250.482
0.6.2 975.492 0.856 244.246
963.076 0.304 240.976
960.272 0.312 240.273
0.6.3 955.372 0.300 239.016
951.704 0.276 238.093
955.288 0.372 239.042
0.6.4 984.248 0.392 246.288
1000.532 0.512 250.430
1002.920 0.664 251.060
nthread = 2
xgboost user system elapsed
0.4.3 372.436 0.288 186.365
371.176 0.520 185.841
370.948 0.300 185.659
0.4.4 370.364 0.112 185.235
367.212 0.084 183.659
368.008 0.056 184.017
0.6.0 905.196 0.188 452.740
901.268 0.260 450.790
912.220 0.332 456.431
0.6.2 911.660 0.480 456.299
909.212 0.400 454.944
923.560 0.572 462.409
0.6.3 915.692 0.708 458.546
915.380 0.488 458.135
912.880 0.428 456.947
0.6.4 894.456 0.296 447.414
898.240 0.088 449.218
902.904 0.196 451.674
nthread = 1
xgboost user system elapsed
0.4.3 336.664 0.112 336.508
336.468 0.184 336.331
326.532 0.060 326.457
0.4.4 331.836 0.264 331.925
338.108 0.928 338.887
338.040 0.076 337.935
0.6.0 856.844 0.336 856.973
857.880 0.780 858.603
856.348 0.684 857.017
0.6.2 854.888 0.480 855.295
854.688 0.156 854.716
850.516 0.100 850.725
0.6.3 849.796 0.492 850.107
842.792 0.048 842.709
846.576 0.296 846.734
0.6.4 841.324 0.056 841.157
848.792 0.288 848.899
849.016 0.132 848.922
environment:
hardware: i7-6700K CPU @ 4.00GHz; 32GB ddr3
software: Ubuntu 16.04.1 LTS 64bit; R version 3.3.2
During the training, xgboost cpu usage reported by top seems to be very close to optimal and independent from xgboost version. Depending on the thread count, xgboost takes approximately 100%, 200%, 400%, 795% CPU for respectively 1,2,4 and 8 threads.
I installed updated version of xgboost 0.6 in python 3.4 and version 0.4 in python 2.7, from the running performance for the same code , version 0.6 is much slower than 0.4, does anyone meet the same situation like me? Is it really the case that version 0.6 is slower than 0.4 or just something wrong with my installation?
Running result(version 0.6):
(1,199) (2,152) (3,100) (4,76)
Running result(version 0.4):
(1, 115) (2, 62) (3, 46) (4, 40)
The code I used to run as follow: