Closed jasperzhong closed 4 years ago
Thank you. This is very interesting. OMP should only affect the server performance (and worker's local CPU reduce, which is much less used). We'll study this.
I suspect this can also help MXNet native PS, which also heavily relies on OpenMP.
@vycezhong Is the 25% surge you have seen based on your compression PR or non-compression case? I thought our server implementation already achieves 100Gbps summation throughput, so it is it at least faster than the NIC and won't be a bottleneck. Maybe I was wrong..
@bobzhuyb It's based on my compression PR but I disabled the compression. It should have no difference with the master. And the task is training imagenet and the model is resnet50_v2.
That also holds true in the compression case. PULL time took the majority of time previously whenever enabling compression.
@ymjiang Can you reproduce?
@vycezhong ResNet-50 performance has always been close to linear speed up in our environment. I can't understand how we may get 25% faster. Maybe it's due to some CPU scheduling issue in VMs (are you using VMs)?
It is unlikely that we will get 25% improvement on ResNet50 since it is very close to optimal. @vycezhong Can you provide a reproducible tutorial using any of our public example?
@bobzhuyb. I did not use VM or docker. It run natively in ubuntu18.04. I will check whether it achieves linear scaling in my settings. If not, there may be some problems I haven't noticed.
@ymjiang i use the script form gluoncv and make some changes to adapt to byteps. https://github.com/vycezhong/byteps/blob/gradient_compression/example/mxnet/train_gluon_imagenet_byteps_gc.py
scheduler (in server node 0)
export DMLC_NUM_WORKER=4
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=4
export DMLC_PS_ROOT_URI=xxx # the scheduler IP
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export PS_VERBOSE=2
servers
export DMLC_NUM_WORKER=4
export DMLC_ROLE=server
export DMLC_NUM_SERVER=4
export DMLC_PS_ROOT_URI=xxx # the scheduler IP
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export OMP_NUM_THREADS=4
export OMP_WAIT_POLICY=PASSIVE # This is the flag that makes a difference.
workers
export NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export DMLC_WORKER_ID=0 # from 0-3 for each worker
export DMLC_NUM_WORKER=4
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=4
export DMLC_PS_ROOT_URI=xxx # the scheduler IP
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export BYTEPS_LOG_LEVEL=WARNING
export OMP_NUM_THREADS=4
export OMP_WAIT_POLICY=PASSIVE # This is the flag that makes a difference.
export BYTEPS_TRACE_ON=1
export BYTEPS_TRACE_END_STEP=110
export BYTEPS_TRACE_START_STEP=100
export BYTEPS_TRACE_DIR=./traces
scheduler
nohup bpslaunch 2>&1 &
servers
nohup bpslaunch >>/dev/null 2>&1 &
workers
nohup bpslaunch python3 ~/repos/byteps/example/mxnet/train_gluon_imagenet_byteps_gc.py --model resnet50_v2 --mode hybrid --rec-train ~/data/ILSVRC2012/train.rec --rec-train-idx ~/data/ILSVRC2012/train.idx --rec-val ~/data/ILSVRC2012/val.rec --rec-val-idx ~/data/ILSVRC2012/val.idx --use-rec --batch-size 64 --num-gpus 1 --num-epochs 120 -j 4 --warmup-epochs 5 --logging-file train_imagenet_exp0.log >>/dev/null 2>&1 &
Note that I do not enable compression in the above command.
Whether to enable export OMP_WAIT_POLICY=PASSIVE
.
Metrics: Throughput
ps. I have no idea why the first one's accuracy did not get convergence to 76%...
One p3.16xlarge can achieve ~2800img/s locally. So 4 x p3.16xlarge should be at the speed of ~11200img/s. Even 8478img/s only achieves 75.7% scaling efficiency. It is impossible... Am i doing something wrong?
script: https://gluon-cv.mxnet.io/_downloads/6ecf3a8b8036c8af2c65a18f473f1acb/train_imagenet.py
command
python train_imagenet.py --model resnet50_v2 --mode hybrid --rec-train ~/data/ILSVRC2012/train.rec --rec-train-idx ~/data/ILSVRC2012/train.idx --rec-val ~/data/ILSVRC2012/val.rec --rec-val-idx ~/data/ILSVRC2012/val.idx --use-rec --batch-size 64 --num-gpus 8 --num-epochs 120 -j 4 --warmup-epochs 5 --logging-file train_imagenet_exp.log
Maybe it is attributed to the problem you mentioned here doc. i should launch more servers to saturate 25Gbps network. But I am a bit worried that the cpu may not be enough. nevertheless, i will do it.
Thanks for the results. It's reasonable. We said ~100% scaling efficiency, but that was on machines with 100Gbps RDMA network (similar to p3dn.24xlarge). p3.16xlarge only has 25Gbps traditional network. Also, BytePS's TCP implementation was most inherited from MXNet native PS, which also needs some optimization.
It's still an interesting observation. Maybe we should suggest users use this env setting, since I don't see any harm with it.
BTW: You said you did not use VM, but you are actually running in p3.16xlarge, which is a VM.
Thanks for your explanation and correction for my mistake! @bobzhuyb we are happy that gradient compression does have some suitable scenarios. 😃
@vycezhong Our benchmarks on 100Gbps RDMA showed negligible improvement. Perhaps you can try launch more TCP servers simultaneously.
@ymjiang thanks for your results. I will try it soon.
@bobzhuyb @ymjiang We conduct an experiment and keep the same settings mentioned above. We don’t start measuring until the speed stabilizes because it is unstable at the beginning of training. The speed is lower than the one before. We speculate that it is because new machines are launched for this experiment, leading to different network topology. Anyway, there are some findings.
The first trick is settting OMP_WAIT_POLICY=PASSIVE. The second is launching more servers as you suggest. We launch one extra server in each server node. As the table shows, both of them improve performance.
Speed (images/s) | Performance Boost | |
---|---|---|
baseline | 4337 | - |
+PASSIVE | 6057 | +40.0% |
+Extra Servers | 4578 | +5.6% |
+Both | 6479 | +49.4% |
But even with both tricks, it is still far away from linear scaling, leaving room for gradient compression.
I happened to read an article about openmp yesterday. https://zhuanlan.zhihu.com/p/118604153
I followed the suggestions in the article and set
OMP_WAIT_POLICY=PASSIVE
for both workers and servers. and the throughput surged around 25%. the profile shows this is because a huge reduction in time taken by PULL operation. so i suppose the flag is beneficial to servers. i guess it eases the contention of cpus between server's threads and openmp's threads. the setting is 4 x p3.16xlarge as workers and 4 x c5dn.xlarge as servers. i wonder whether it works in other cases.