kendu605 commented 7 years ago

I installed updated version of xgboost 0.6 in python 3.4 and version 0.4 in python 2.7, from the running performance for the same code , version 0.6 is much slower than 0.4, does anyone meet the same situation like me? Is it really the case that version 0.6 is slower than 0.4 or just something wrong with my installation?

Running result(version 0.6):

(1,199) (2,152) (3,100) (4,76)

Running result(version 0.4):

(1, 115) (2, 62) (3, 46) (4, 40)

The code I used to run as follow:

from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
import time

data = read_csv('train.csv'), # Kaggel Otto competition, train.csv 
dataset = data.values
X = dataset[:,0:94]
y = dataset[:,94]
label_encoded_y = LabelEncoder().fit_transform(y)
results = []
num_threads = [1, 2, 3, 4]
for n in num_threads:
    start = time.time()
    model = XGBClassifier(nthread=n)
    model.fit(X, label_encoded_y)
    elapsed = time.time() - start
    print(n, elapsed)

phunterlau commented 7 years ago

@kendu605 can you tell the operating system, compiler, and the CPU usage when training using top?

kendu605 commented 7 years ago

@phunterlau ,I use win7 64-bit, and used visual studio 2013 to compile the xgboost, my PC have a 8 cores cpu

phunterlau commented 7 years ago

@kendu605 what is your CPU usage while training the model?

kendu605 commented 7 years ago

@phunterlau , the CPU usage was around (20%,35%,%48,%60) with the parameter 'nthread' change from 1~4

JohnStott commented 7 years ago

kendu605, I find that compiling xgboost using the MingW64 compiler creates a dll that is about 50% faster than when I compile directly from ms visual studio. Could this be what you are seeing? Could your initial 0.4 build have been produced via a different compiler?

I tried different settings in visual studio such as: 1) ensuring compiling using "release" mode. project->properties->configuration properties C/C++ 2) -> Optimzation, set "Favor Size Or Speed" to "Speed" 3) -> Code Generation, played around with "Floating Point Model" settings. unfortunately this made little difference. The last thing I tried was: 4) -> Enable Enhanced Instruction Set, this is where I think my problem may lie, my cpu's support upto SSE4.1 however the max MS allow is SSE2. Unfortunately there is no AVX on my cpus so can't use those settings. I am wondering whether MingW64 does have support for SSE4.1 and that is why it runs faster when compiled with this? The only alternative reason I can think is that MingW is perhaps far superior to the MS compiler (for XGBoost). If this is true, I wonder what other compiler's would benchmark as? Unfortunately I can't really find much info on google with regards to compiler benchmarks, the consensus opinion (from forums) seems to be anecdotally that they should all be of similar performance!

Perhaps you could try points 1-4 see if this makes any difference for your machine? And also a MingW compile. I'd be very interested in your results.

kendu605 commented 7 years ago

@JohnStott ,thanks a lot for your detailed explanation and trying so many tests. I used VS 2013 win64 to complie for both xgboost 0.4 and 0.6, I have tried your suggestions you provided, but got the same results as I posted before. From your description, I think it may be that VS 2013 complier can't support all features of xgboost 0.6, so that it slower than 0.4. I will try to re-complie it via VS 2015 or MingW64, and then to see if I can get better performance on version 0.6. Thanks again for your advices.

JohnStott commented 7 years ago

I forgot to mention that I was using VS 2015 when I tried the above (and python 2.7 - I've yet to look at python 3+). I would suggest trying MingW64 first as you'll probably get same results in VS 2015!?

kendu605 commented 7 years ago

@JohnStott, I just used VS 2015 building xgboost 0.6 with your settings proposal, but unfortunately, I got the same result when running the example code with version 0.6 on python3.4, which is still slower than running version 0.4 on python3.4. I tried many times to compile it using MingW64, but all fail, I use TDM-MinGW, with the commands provided by xgboost official website, as "alias make='mingw32-make'; cp make/mingw64.mk config.mk; make -j4"

JohnStott commented 7 years ago

When you say it fails, I assume you mean there are errors during compilation? Have you tried compiling the other libraries first? i.e., this is what I use:

cd dmlc-core
make -j4
cd ../rabit
make lib/librabit_empty.a -j4
cd ..
cp make/mingw64.mk config.mk
make -j4

I also use this version: https://sourceforge.net/projects/mingw-w64/

(when installing, choose "Achitecture" : x86_64 (assuming you're using a 64-bit machine))

kendu605 commented 7 years ago

I have downloaded the new mingw-w64, but got errors below when compile with code: cd dmlc-core; make -j4 'g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o io.o src/io.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o data.o src/data.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o config.o src/config.cc

src/io.cc:1:0: warning: -fPIC ignored for target (all code is position independe nt) [enabled by default] src/data.cc:1:0: warning: -fPIC ignored for target (all code is position indepen dent) [enabled by default] src/config.cc:1:0: warning: -fPIC ignored for target (all code is position indep endent) [enabled by default] In file included from src/data/././text_parser.h:11:0, from src/data/./libsvm_parser.h:13, from src/data/disk_row_iter.h:19, from src/data.cc:12: include/dmlc/omp.h:9:17: fatal error: omp.h: No such file or directory compilation terminated. make: * [Makefile:83: data.o] Error 1 make: * Waiting for unfinished jobs.... In file included from src/io/cached_input_split.h:16:0, from src/io.cc:13: include/dmlc/threadediter.h:210:3: error: 'thread' in namespace 'std' does not n ame a type include/dmlc/threadediter.h:216:3: error: 'mutex' in namespace 'std' does not na me a type include/dmlc/threadediter.h:222:3: error: 'condition_variable' in namespace 'std ' does not name a type include/dmlc/threadediter.h:224:3: error: 'condition_variable' in namespace 'std ' does not name a type include/dmlc/threadediter.h: In constructor 'dmlc::ThreadedIter::Threaded Iter(size_t)': include/dmlc/threadediter.h:82:9: error: class 'dmlc::ThreadedIter' does not have any field named 'producerthread' include/dmlc/threadediter.h: In member function 'virtual void dmlc::ThreadedIter

::BeforeFirst()': include/dmlc/threadediter.h:168:5: error: 'unique_lock' is not a member of 'std' include/dmlc/threadediter.h:168:22: error: 'mutex' is not a member of 'std' include/dmlc/threadediter.h:168:39: error: 'mutex_' was not declared in this sco pe include/dmlc/threadediter.h:168:45: error: there are no arguments to 'lock' that depend on a template parameter, so a declaration of 'lock' must be available [- fpermissive] include/dmlc/threadediter.h:168:45: note: (if you use '-fpermissive', G++ will a ccept your code, but allowing the use of an undeclared name is deprecated) include/dmlc/threadediter.h:178:7: error: 'producer_cond_' was not declared in t his scope include/dmlc/threadediter.h:182:5: error: 'consumer_cond_' was not declared in t his scope include/dmlc/threadediter.h:182:25: error: 'lock' was not declared in this scope include/dmlc/threadediter.h:189:17: error: 'producer_cond_' was not declared in this scope include/dmlc/threadediter.h: In member function 'void dmlc::ThreadedIter: :Destroy()': include/dmlc/threadediter.h:236:7: error: 'producer_thread_' was not declared in this scope include/dmlc/threadediter.h:239:7: error: 'lock_guard' is not a member of 'std' include/dmlc/threadediter.h:239:23: error: 'mutex' is not a member of 'std' include/dmlc/threadediter.h:239:40: error: 'mutex_' was not declared in this sco pe include/dmlc/threadediter.h:239:46: error: there are no arguments to 'lock' that depend on a template parameter, so a declaration of 'lock' must be available [- fpermissive] include/dmlc/threadediter.h:243:9: error: 'producer_cond_' was not declared in t his scope include/dmlc/threadediter.h: In lambda function: include/dmlc/threadediter.h:295:9: error: 'unique_lock' is not a member of 'std' include/dmlc/threadediter.h:295:26: error: 'mutex' is not a member of 'std' include/dmlc/threadediter.h:295:43: error: 'mutex_' was not declared in this sco pe include/dmlc/threadediter.h:295:49: error: there are no arguments to 'lock' that depend on a template parameter, so a declaration of 'lock' must be available [- fpermissive] include/dmlc/threadediter.h:297:9: error: 'producer_cond_' was not declared in t his scope include/dmlc/threadediter.h:297:29: error: 'lock' was not declared in this scope include/dmlc/threadediter.h: In lambda function: include/dmlc/threadediter.h:299:27: internal compiler error: Segmentation fault Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html for instructions. make: **\* [Makefile:83: io.o] Error 1'

JohnStott commented 7 years ago

include/dmlc/omp.h:9:17: fatal error: omp.h: No such file or directory compilation terminated.

^seems to be the culprit!? Maybe try again with a totally fresh clone of xgboost?

kendu605 commented 7 years ago

I cloned the new source, but still no help. Don't konw where is wrong.

JohnStott commented 7 years ago

It seems that you are still using an (old?) makefile because if you look towards the bottom of:

https://github.com/dmlc/dmlc-core/blob/f35f14f30835af238257b979cc1fac3e41ff3291/Makefile

you will see:

line_split.o: src/io/line_split.cc recordio_split.o: src/io/recordio_split.cc input_split_base.o: src/io/input_split_base.cc hdfs_filesys.o: src/io/hdfs_filesys.cc s3_filesys.o: src/io/s3_filesys.cc azure_filesys.o: src/io/azure_filesys.cc local_filesys.o: src/io/local_filesys.cc io.o: src/io.cc data.o: src/data.cc recordio.o: src/recordio.cc config.o: src/config.cc

when I compile I get: $ mingw32-make -j4 g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o line_split.o src/io/li ne_split.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o recordio_split.o src/i o/recordio_split.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o input_split_base.o src /io/input_split_base.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o io.o src/io.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o local_filesys.o src/io /local_filesys.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o data.o src/data.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o recordio.o src/recordi o.cc g++ -c -O3 -Wall -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDML C_USE_HDFS=0 -DDMLC_USE_S3=0 -DDMLC_USE_AZURE=0 -msse2 -o config.o src/config.cc ar cr libdmlc.a line_split.o recordio_split.o input_split_base.o io.o local_file sys.o data.o recordio.o config.o

Notice the order is the same!

Yours starts with src/io.cc which doesn't adhere to makefile.

That would be my guess anyway.

JohnStott commented 7 years ago

actually, I only checked the first couple of lines, it seems the order of mine is not exactly the same. Sorry I'm not sure? Maybe someone else can help?

Maybe check your makefile is identical to above link just to be sure?

JohnStott commented 7 years ago

...just had another quick look, line 33 in makefile (see link above) you will see: OBJ=line_split.o recordio_split.o input_split_base.o io.o local_filesys.o data.o recordio.o config.o

which is the same as the order, I think I understand makefile now. I assume your line 33 is different, hence your bad output!

kendu605 commented 7 years ago

Hi JohnStott, I copied the makefile from source and try to make again, but still nothing better happen. Besides MinGW64, do I need to install other softwares to compile it successfully? As stated in my original post, I can complie xgboost successfully both on version 0.6 and 0.4 with VS2013 or VS2015, but what the confusing thing is version 0.6 runs much slower than 0.4

JohnStott commented 7 years ago

I, originally, was thinking that there is something perhaps in version 0.6 that VS doesn't like, or at least affects the efficiency somewhere due to my observations between speed of MingW64 and VS compilations even after manually optimising VS settings.

I imagine this is pretty frustrating! If I were you and you haven't already tried the following I'd give it a go: 1) rename the root directory for each of your existing xgboost directories.
2) re-download xgboost and place into a directory with a new never used before name. 3) try compiling as per discussed previously.

if different errors then I'd try re-downloading xgboost and place into a directory with the same name as the very first xgboost you ever used and compiled.

Basically, all of the above is to check that new clone's don't end up pointing to old files (in case settings get set in registry etc)

dana33 commented 7 years ago

I see a similar difference between versions building XGBoost models in R on Windows.

Details: Classification model from training set with 145 categories, 11K rows, 260 continuous features Model contains 100 trees Time to build with xgboost 0.4-3 on R 3.1.2: 10 minutes Time to build with xgboost 0.6-4 on R 3.3.2: 51 minutes

I did not build the xgboost package myself but just installed the download from CRAN.

Has anyone yet come up with a good explanation for this or a workaround other than simply using the older version?

khotilov commented 7 years ago

I can confirm a similar speed degradation (~10x slowdown) when comparing the recent xgboost to 0.4-3 (on both Windows and Linux) using a simulated dataset with the same parameters as @dana33 has reported. Will need to take a deeper look...

tqchen commented 7 years ago

one difference might be the switch of default missing from 0->NA? Which makes the 0 being enumerated, when the matrix is passed in as dense matrix. Try the following to confirm

dmat = xgb.DMatrix(data, missing=0)

dana33 commented 7 years ago

@tqchen, thank you for the suggestion. I tried your suggestion, but it had no effect on the speed.

In case it matters, here is how I construct my data matrix:

Before: x <- xgb.DMatrix(sparse.model.matrix(~.-1, data=xy[-1]))

After: x <- xgb.DMatrix(sparse.model.matrix(~.-1, data=xy[-1]), missing=0)

(The data come in as a dense data.frame xy containing both the features and the response.)

The time to build the model in R 3.3.2 with xgboost 0.6-4 is the same in both cases, and still much slower compared to R 3.1.2 with xgboost 0.4-3.

P.S. The reason for the call to sparse.model.matrix is to do one-hot encoding for any factors in the input data. In this particular case, all features are continuous, so I was able to try the following: x <- xgb.DMatrix(data=as.matrix(xy[-1]), missing=0) Unfortunately, this still did not improve the speed.

khotilov commented 7 years ago

@tqchen I think I see the issue: large amount of time is spent in prediction cache updates at https://github.com/dmlc/xgboost/blob/9fb46e2c5efbb7ea7bb0cbb0f815dbdc9b720177/src/gbm/gbtree.cc#L475 This compounds to nclass^2 complexity, since it is done for each and every separately committed tree, and PredValue is called for all output groups within PredLoopSpecalize.

dana33 commented 7 years ago

@khotilov, thank you for the detective work. Do you know how this is done differently in version 0.4-3 to make it so much faster?

khotilov commented 7 years ago

@dana33 That caching mechanism was introduced in https://github.com/dmlc/xgboost/commit/ecec5f7959cbe37a14f0ef83c9736f9c2a9490dc#diff-36e32f8e52bbd405d8cc60e601c9ae41, which was way after 0.4-3.

tqchen commented 7 years ago

@khotilov Can you confirm this by do a bit timing, around the UpdateCache, as well as Tree growing between 0.4-3 and 0.6? Since the predictive cache was in 0.4-3, except that it is lazily updated, instead of being updated eagerly

khotilov commented 7 years ago

@tqchen I did actually find it by narrowing down where the time was spent in the current code. E.g., in a single boosting iteration, building the trees part was taking ~15sec, and then ~50sec was spent to update the cache for training data.

stolorz commented 7 years ago

I have also noticed a performance drop on my machine between xgboost version 0.4.x and 0.6.x In case anyone can find it useful and maybe shed some light on the problem, I'm sharing my results. Here is the R source code which i have used:

require(xgboost)
require(data.table)
require(dplyr)

#data: https://www.kaggle.com/c/otto-group-product-classification-challenge/data
train.csv <- fread('data/otto-train.csv', header = T, stringsAsFactors = F)
 test.csv <- fread('data/otto-test.csv',  header = T, stringsAsFactors = F)

x <- train.csv %>% select(-id, -target) %>%  sapply(as.numeric) %>% as.matrix
y <- train.csv$target %>% factor() %>% as.integer() %>% as.matrix %>% -1

param <- list("objective" = "multi:softprob",
              "eval_metric" = "mlogloss",
              "num_class" =  length(unique(y)) )

ptm <- proc.time()
bst = xgboost(param=param, data = x, label = y, nrounds=500, verbose = 0, nthread=8)
training.time <- proc.time() - ptm

packageVersion("xgboost")
training.time

and timing results; three tests for each combination of xgboost version and thread count:

nthread = 8

xgboost    user   system   elapsed 
0.4.3   656.376    0.380    83.414 
        646.200    0.264    81.534 
        653.816    0.300    82.670

0.4.4   648.556    0.256    81.853 
        647.780    0.292    81.760
        665.864    0.548    84.708

0.6.0  1742.980    0.736   224.632 
       1743.568    0.804   223.994
       1781.148    1.636   230.141 

0.6.2  1725.600    1.740   222.746
       1822.628    1.076   236.612
       1591.916    0.792   201.286 

0.6.3  1570.628    0.444   197.503 
       1577.364    0.636   198.558 
       1569.276    0.664   197.343 

0.6.4  1614.984    1.196   205.332  
       1653.092    0.772   210.690 
       1688.656    0.584   215.723 

nthread = 4

xgboost    user   system   elapsed 
0.4.3   403.780    0.168   101.074 
        407.468    0.276   102.014 
        406.576    0.272   101.821 

0.4.4   407.160    0.208   101.904 
        411.816    0.196   103.105 
        403.968    0.156   101.117 

0.6.0   989.728    1.448   247.938 
       1022.488    1.228   256.089 
       1000.572    0.848   250.482 

0.6.2   975.492    0.856   244.246 
        963.076    0.304   240.976 
        960.272    0.312   240.273 

0.6.3   955.372    0.300   239.016 
        951.704    0.276   238.093 
        955.288    0.372   239.042 

0.6.4   984.248    0.392   246.288 
       1000.532    0.512   250.430 
       1002.920    0.664   251.060 

nthread = 2

xgboost    user   system   elapsed 
0.4.3   372.436    0.288   186.365 
        371.176    0.520   185.841 
        370.948    0.300   185.659 

0.4.4   370.364    0.112   185.235 
        367.212    0.084   183.659 
        368.008    0.056   184.017 

0.6.0   905.196    0.188   452.740 
        901.268    0.260   450.790 
        912.220    0.332   456.431 

0.6.2   911.660    0.480   456.299 
        909.212    0.400   454.944 
        923.560    0.572   462.409 

0.6.3   915.692    0.708   458.546 
        915.380    0.488   458.135 
        912.880    0.428   456.947

0.6.4   894.456    0.296   447.414 
        898.240    0.088   449.218 
        902.904    0.196   451.674 

nthread = 1

xgboost    user   system   elapsed 
0.4.3   336.664    0.112   336.508 
        336.468    0.184   336.331 
        326.532    0.060   326.457 

0.4.4   331.836    0.264   331.925 
        338.108    0.928   338.887 
        338.040    0.076   337.935 

0.6.0   856.844    0.336   856.973 
        857.880    0.780   858.603
        856.348    0.684   857.017 

0.6.2   854.888    0.480   855.295 
        854.688    0.156   854.716 
        850.516    0.100   850.725 

0.6.3   849.796    0.492   850.107 
        842.792    0.048   842.709 
        846.576    0.296   846.734 

0.6.4   841.324    0.056   841.157
        848.792    0.288   848.899 
        849.016    0.132   848.922

environment:

hardware: i7-6700K CPU @ 4.00GHz; 32GB ddr3
software: Ubuntu 16.04.1 LTS 64bit; R version 3.3.2

During the training, xgboost cpu usage reported by top seems to be very close to optimal and independent from xgboost version. Depending on the thread count, xgboost takes approximately 100%, 200%, 400%, 795% CPU for respectively 1,2,4 and 8 threads.

dmlc / xgboost

Does Xgboost version 0.6 run slower than 0.4 in python? #1689

Running result(version 0.6):

Running result(version 0.4):

The code I used to run as follow: