one weird trick to reduce PULL time significantly

jasperzhong commented 4 years ago

I happened to read an article about openmp yesterday. https://zhuanlan.zhihu.com/p/118604153

I followed the suggestions in the article and set OMP_WAIT_POLICY=PASSIVE for both workers and servers. and the throughput surged around 25%. the profile shows this is because a huge reduction in time taken by PULL operation. so i suppose the flag is beneficial to servers. i guess it eases the contention of cpus between server's threads and openmp's threads. the setting is 4 x p3.16xlarge as workers and 4 x c5dn.xlarge as servers. i wonder whether it works in other cases.

bobzhuyb commented 4 years ago

Thank you. This is very interesting. OMP should only affect the server performance (and worker's local CPU reduce, which is much less used). We'll study this.

bobzhuyb commented 4 years ago

I suspect this can also help MXNet native PS, which also heavily relies on OpenMP.

bobzhuyb commented 4 years ago

@vycezhong Is the 25% surge you have seen based on your compression PR or non-compression case? I thought our server implementation already achieves 100Gbps summation throughput, so it is it at least faster than the NIC and won't be a bottleneck. Maybe I was wrong..

jasperzhong commented 4 years ago

@bobzhuyb It's based on my compression PR but I disabled the compression. It should have no difference with the master. And the task is training imagenet and the model is resnet50_v2.

jasperzhong commented 4 years ago

That also holds true in the compression case. PULL time took the majority of time previously whenever enabling compression.

bobzhuyb commented 4 years ago

@ymjiang Can you reproduce?

@vycezhong ResNet-50 performance has always been close to linear speed up in our environment. I can't understand how we may get 25% faster. Maybe it's due to some CPU scheduling issue in VMs (are you using VMs)?

ymjiang commented 4 years ago

It is unlikely that we will get 25% improvement on ResNet50 since it is very close to optimal. @vycezhong Can you provide a reproducible tutorial using any of our public example?

jasperzhong commented 4 years ago

@bobzhuyb. I did not use VM or docker. It run natively in ubuntu18.04. I will check whether it achieves linear scaling in my settings. If not, there may be some problems I haven't noticed.

jasperzhong commented 4 years ago

@ymjiang i use the script form gluoncv and make some changes to adapt to byteps. https://github.com/vycezhong/byteps/blob/gradient_compression/example/mxnet/train_gluon_imagenet_byteps_gc.py

HW settings:

Workers: 4 x p3.16xlarge
Servers: 4 x c5n.xlarge

HYPER-PARAMS:

Batch size: 64 per GPU (follow byteps)
Momentum = 0.9
Other hyper-params follow default settings of gluoncv

ENV:

scheduler (in server node 0)

export DMLC_NUM_WORKER=4 
export DMLC_ROLE=scheduler 
export DMLC_NUM_SERVER=4
export DMLC_PS_ROOT_URI=xxx  # the scheduler IP 
export DMLC_PS_ROOT_PORT=1234  # the scheduler port
export PS_VERBOSE=2

servers

export DMLC_NUM_WORKER=4
export DMLC_ROLE=server  
export DMLC_NUM_SERVER=4
export DMLC_PS_ROOT_URI=xxx  # the scheduler IP 
export DMLC_PS_ROOT_PORT=1234  # the scheduler port
export OMP_NUM_THREADS=4 
export OMP_WAIT_POLICY=PASSIVE # This is the flag that makes a difference.

workers

export NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export DMLC_WORKER_ID=0 # from 0-3 for each worker  
export DMLC_NUM_WORKER=4 
export DMLC_ROLE=worker 
export DMLC_NUM_SERVER=4 
export DMLC_PS_ROOT_URI=xxx  # the scheduler IP 
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export BYTEPS_LOG_LEVEL=WARNING
export OMP_NUM_THREADS=4
export OMP_WAIT_POLICY=PASSIVE # This is the flag that makes a difference. 
export BYTEPS_TRACE_ON=1
export BYTEPS_TRACE_END_STEP=110
export BYTEPS_TRACE_START_STEP=100
export BYTEPS_TRACE_DIR=./traces

command

scheduler

nohup bpslaunch 2>&1 &

servers

nohup bpslaunch >>/dev/null 2>&1 &

workers

nohup bpslaunch python3 ~/repos/byteps/example/mxnet/train_gluon_imagenet_byteps_gc.py --model resnet50_v2 --mode hybrid --rec-train ~/data/ILSVRC2012/train.rec --rec-train-idx ~/data/ILSVRC2012/train.idx --rec-val ~/data/ILSVRC2012/val.rec --rec-val-idx ~/data/ILSVRC2012/val.idx --use-rec --batch-size 64 --num-gpus 1 --num-epochs 120 -j 4 --warmup-epochs 5 --logging-file train_imagenet_exp0.log >>/dev/null 2>&1 &

Note that I do not enable compression in the above command.

Results

Whether to enable export OMP_WAIT_POLICY=PASSIVE.

Metrics: Throughput

Disable : 6713 img/s train_imagenet_exp0.log
Enable: 8478 img/s train_imagenet_exp0.enable.log

ps. I have no idea why the first one's accuracy did not get convergence to 76%...

jasperzhong commented 4 years ago

One p3.16xlarge can achieve ~2800img/s locally. So 4 x p3.16xlarge should be at the speed of ~11200img/s. Even 8478img/s only achieves 75.7% scaling efficiency. It is impossible... Am i doing something wrong?

reproducible local training

script: https://gluon-cv.mxnet.io/_downloads/6ecf3a8b8036c8af2c65a18f473f1acb/train_imagenet.py

command

python train_imagenet.py --model resnet50_v2 --mode hybrid --rec-train ~/data/ILSVRC2012/train.rec --rec-train-idx ~/data/ILSVRC2012/train.idx --rec-val ~/data/ILSVRC2012/val.rec --rec-val-idx ~/data/ILSVRC2012/val.idx --use-rec --batch-size 64 --num-gpus 8 --num-epochs 120 -j 4 --warmup-epochs 5 --logging-file train_imagenet_exp.log

jasperzhong commented 4 years ago

Maybe it is attributed to the problem you mentioned here doc. i should launch more servers to saturate 25Gbps network. But I am a bit worried that the cpu may not be enough. nevertheless, i will do it.

bobzhuyb commented 4 years ago

Thanks for the results. It's reasonable. We said ~100% scaling efficiency, but that was on machines with 100Gbps RDMA network (similar to p3dn.24xlarge). p3.16xlarge only has 25Gbps traditional network. Also, BytePS's TCP implementation was most inherited from MXNet native PS, which also needs some optimization.

It's still an interesting observation. Maybe we should suggest users use this env setting, since I don't see any harm with it.

BTW: You said you did not use VM, but you are actually running in p3.16xlarge, which is a VM.

jasperzhong commented 4 years ago

Thanks for your explanation and correction for my mistake! @bobzhuyb we are happy that gradient compression does have some suitable scenarios. 😃

ymjiang commented 4 years ago

@vycezhong Our benchmarks on 100Gbps RDMA showed negligible improvement. Perhaps you can try launch more TCP servers simultaneously.

jasperzhong commented 4 years ago

@ymjiang thanks for your results. I will try it soon.

jasperzhong commented 4 years ago

@bobzhuyb @ymjiang We conduct an experiment and keep the same settings mentioned above. We don’t start measuring until the speed stabilizes because it is unstable at the beginning of training. The speed is lower than the one before. We speculate that it is because new machines are launched for this experiment, leading to different network topology. Anyway, there are some findings.

The first trick is settting OMP_WAIT_POLICY=PASSIVE. The second is launching more servers as you suggest. We launch one extra server in each server node. As the table shows, both of them improve performance.

	Speed (images/s)	Performance Boost
baseline	4337	-
+PASSIVE	6057	+40.0%
+Extra Servers	4578	+5.6%
+Both	6479	+49.4%

But even with both tricks, it is still far away from linear scaling, leaving room for gradient compression.

bytedance / byteps