LightGBM in parallel: demo results (and with xgboost)

Laurae2 commented 5 years ago

I just ran this: https://github.com/Laurae2/ml-perf/issues/5#issuecomment-491969652

If you want to see the numbers, skip the conclusions below.

Conclusions for our scenario, CPU:

LightGBM CPU is about 1/3 faster than xgboost CPU
xgboost might use less RAM bandwidth than LightGBM, because with LightGBM hyperthreaded cores are not providing additional performance
LightGBM maintains correct efficiency for "small" data with too many threads, while xgboost literally takes forever to train a single model with a bad huge number of threads (50x slower, still probably way faster than Spark)
Multiprocessing scales nearly linearly (embarassingly/perfect parallel jobs), multithreading scaling is significantly poorer (way harder to parallelize)

Conclusions for our scenario, GPU:

LightGBM GPU is way faster than xgboost GPU (3x speed)
LightGBM GPU uses significantly less RAM than xgboost GPU (due to high dimensional sparse data), allowing more workers per GPU on LightGBM (up to 68 in LightGBM, only 4 in xgboost)
LightGBM uses few % GPU while xgboost was using 100% GPU (due to high dimensional sparse data)
Subcribing multiple jobs on GPU, when not the GPU is not fully busy, boosts the performance nearly linearly (note that there are parts of the timings which fully use CPU only (and forced singlethreaded), those are not easily compressible by adding "more threads")
More GPUs means more linear scaling
With LightGBM, you are likely to run out of cores/threads and more likely to hit R limits on communication sockets before you fully saturate the GPUs unless you use a mid-upper tier server (30+ physical cores) where you pack up many Quadro or Volta GPUs (you could fit 544 LightGBM jobs in a 32GB V100...)

General conclusion:

Don't speed up a specific computation to make it run as fast as possible... others will take a toll unless you know your critical path in the graph of the work thrown at the compute server
Throwing stuff in parallel (multiprocessing) while maintaining multithreading as low as possible (per process) yields the highest overall performance while yielding the worst single process performance - this seems counter intuitive but remember: "many lanes going slow will finish faster than one lane going fast"
Make sure you got enough RAM, otherwise your jobs will interrupt - this requires intensive testing to ensure RAM constraint is controlled appropriately

For information, I use the following hardware:

2x Xeon Gold 6154 (36 cores / 72 threads, 3.7 GHz all turbo)
4x 64GB RAM 2666 MHz (80 GBps)
4x NVIDIA Quadro P1000

Baselines:

xgboost CPU : 1 CPU thread model throughput = (11.389 x 25 + 11.383 x 50) / 75 = 11.385s
xgboost GPU : 1 GPU thread model throughput = 20.441s
LightGBM CPU : 1 CPU thread model throughput = (6.539 x 100 + 6.502 x 250) / 350 = 6.513s
LightGBM GPU : 1 GPU thread model throughput = 6.769s

For reference:

Parallel threads = processes/threads used in parallel to run R (multiprocessing through sockets) Model threads = threads used to run xgboost (multithreading) Parallel GPUs = number of GPUs used in parallel processes/threads in R Parallel GPU threads = number of processes running in a single GPU Models = number of models to train in total Seconds / Model = average throughput for 1 model, in seconds Boost vs Baseline = your performance gain if you were to do the mentioned row vs doing only 1 CPU (or 1 GPU if GPU) process/thread for your model

LightGBM CPU:

Run	Parallel Threads	Model Threads	Models	Seconds / Model	Boost vs Baseline
20	1	1	100	6.539	~1x
21	9	1	250	0.760	8.57x
22	18	1	500	0.400	16.28x
23	35	1	1000	0.252	25.85x
24	70	1	2500	0.295	22.08x
25	1	1	250	6.502	~1x
26	1	9	250	2.315	2.81x
27	1	18	250	2.269	2.87x
28	1	35	250	2.485	2.62x
29	1	70	250	3.051	2.13x

LightGBM GPU:

Run	Parallel Threads	Model Threads	Parallel GPUs	GPU Threads	Models	Seconds / Model	Boost vs Baseline
1	1	1	1	1	50	6.769	~1x
2	2	1	2	1	100	3.481	1.94x
3	3	1	3	1	250	2.354	2.88x
4	4	1	4	1	500	1.790	3.78x
5	4	1	1	4	100	2.166	3.13x
6	8	1	2	4	250	1.121	6.04x
7	12	1	3	4	500	0.772	8.77x
8	16	1	4	4	1000	0.586	11.55x
9	9	1	1	9	250	1.298	5.21x
10	18	1	2	9	500	0.709	9.55x
11	27	1	3	9	1000	0.496	13.65x
12	36	1	4	9	2500	0.400	16.92x
13	18	1	1	18	500	1.200	5.64x
14	36	1	2	18	1000	0.633	10.69x
15	54	1	3	18	2500	0.464	14.59x
16	72	1	4	18	5000	0.431	15.71x
17	35	1	1	35	1000	1.194	5.67x
18	35	1	2	35	2500	0.632	10.71x
19	58	1	1	58	2500	1.185	5.71x

I also refreshed xgboost hist results (re-ran them).

xgboost CPU:

Run	Parallel Threads	Model Threads	Models	Seconds / Model	Boost vs Baseline
9	1	1	25	11.389	~1x
10	9	1	50	1.456	7.82x
11	18	1	100	0.782	14.56x
12	35	1	250	0.489	23.28x
13	70	1	500	0.428	26.60x
14	1	1	50	11.383	~1x
15	1	9	50	6.565	1.73x
16	1	18	50	6.481	1.76x
17	1	35	50	24.601	0.46x
18	1	70	50	165.947	0.07x

xgboost GPU:

Run	Parallel Threads	Model Threads	Parallel GPUs	GPU Threads	Models	Seconds / Model	Boost vs Baseline
1	1	1	1	1	25	20.441	~1x
2	2	1	2	1	50	10.639	1.92x
3	3	1	3	1	100	6.978	2.93x
4	4	1	4	1	250	5.176	3.95x
5	4	1	1	4	50	20.556	0.99x
6	8	1	2	4	100	10.501	1.95x
7	12	1	3	4	250	6.914	2.96x
8	16	1	4	4	500	5.295	3.86x

szilard commented 5 years ago

I will include this (simplified results for xgboost CPU) in my talks (with credit to @Laurae2):

Parallel Threads	Model Threads	Models	Seconds / Model
1	1	25	11.39
9	1	50	1.46
18	1	100	0.78
35	1	250	0.49
70	1	500	0.43
1	1	50	11.4
1	9	50	6.6
1	18	50	6.6
1	35	50	25
1	70	50	165

(for easy ref: 2 socket system with 18+18HT cores each socket, total 72 cores; 0.1m dataset; 500 trees, depth 6, learn rate 0.05)

szilard commented 4 years ago

I will include this (simplified results for xgboost CPU) in my talks (with credit to @Laurae2):

Models Same Time	Threads per Model	Models	Seconds / Model
1	1	25	11.39
9	1	50	1.46
18	1	100	0.78
35	1	250	0.49
70	1	500	0.43
1	1	50	11.4
1	9	50	6.6
1	18	50	6.6
1	35	50	25
1	70	50	165

(for easy ref: 2 socket system with 18+18HT cores each socket, total 72 cores; 0.1m dataset; 500 trees, depth 6, learn rate 0.05)

Laurae2 / ml-perf

LightGBM in parallel: demo results (and with xgboost) #6