apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Weights are not updated in distributed training #8077

Open zzningxp opened 7 years ago

zzningxp commented 7 years ago

Following https://mxnet.incubator.apache.org/how_to/multi_devices.html , using cifar10 dataset and example/image-classification/train_cifar10.py to train a resnet50 model with two worker nodes, I encountered the problem below:

The environment is not general. The storage is shared distributed storage and network file system: Lustre. The task submission uses SLURM. I rewrite a custom launcher, to alter ssh command to slurm command. The launching environment variables is :

export DMLC_ROLE=worker; 
export DMLC_PS_ROOT_PORT=9091; 
export DMLC_PS_ROOT_URI=x.x.x.x; 
export DMLC_NUM_SERVER=2; 
export DMLC_NUM_WORKER=2;

1) when using _--kv-store distsync, the log is:

INFO:root:Epoch[0] Batch [20] Speed: 26.81 samples/sec accuracy=0.109747
INFO:root:Epoch[0] Batch [20] Speed: 26.60 samples/sec accuracy=0.095238
INFO:root:Epoch[0] Batch [40] Speed: 28.01 samples/sec accuracy=0.108203
INFO:root:Epoch[0] Batch [40] Speed: 27.41 samples/sec accuracy=0.107031
INFO:root:Epoch[0] Batch [60] Speed: 27.13 samples/sec accuracy=0.090625
INFO:root:Epoch[0] Batch [60] Speed: 27.61 samples/sec accuracy=0.100781
INFO:root:Epoch[0] Batch [80] Speed: 28.49 samples/sec accuracy=0.091797
INFO:root:Epoch[0] Batch [80] Speed: 27.47 samples/sec accuracy=0.094922
INFO:root:Epoch[0] Batch [100] Speed: 28.61 samples/sec accuracy=0.104688
INFO:root:Epoch[0] Batch [100] Speed: 27.46 samples/sec accuracy=0.103125

after 32 epoches, the log is:

INFO:root:Epoch[32] Batch [120] Speed: 27.99 samples/sec    accuracy=0.105078
INFO:root:Epoch[32] Batch [60]  Speed: 27.51 samples/sec    accuracy=0.091016
INFO:root:Epoch[32] Batch [140] Speed: 26.91 samples/sec    accuracy=0.108203
INFO:root:Epoch[32] Batch [80]  Speed: 27.41 samples/sec    accuracy=0.091016
INFO:root:Epoch[32] Batch [160] Speed: 27.89 samples/sec    accuracy=0.100391
INFO:root:Epoch[32] Batch [100] Speed: 27.50 samples/sec    accuracy=0.105859
INFO:root:Epoch[32] Batch [180] Speed: 27.90 samples/sec    accuracy=0.092578
INFO:root:Epoch[32] Batch [120] Speed: 27.49 samples/sec    accuracy=0.102734
INFO:root:Epoch[32] Train-accuracy=0.088021
INFO:root:Epoch[32] Time cost=896.106

2) when removed _--kv-store distsync, the log is:

INFO:root:Epoch[0] Batch [20] Speed: 26.51 samples/sec accuracy=0.133185
INFO:root:Epoch[0] Batch [20] Speed: 26.84 samples/sec accuracy=0.133185
INFO:root:Epoch[0] Batch [40] Speed: 27.01 samples/sec accuracy=0.191016
INFO:root:Epoch[0] Batch [40] Speed: 27.43 samples/sec accuracy=0.191016
INFO:root:Epoch[0] Batch [60] Speed: 27.28 samples/sec accuracy=0.241406
INFO:root:Epoch[0] Batch [60] Speed: 27.47 samples/sec accuracy=0.241406
INFO:root:Epoch[0] Batch [80] Speed: 27.43 samples/sec accuracy=0.252344
INFO:root:Epoch[0] Batch [80] Speed: 27.45 samples/sec accuracy=0.252344
INFO:root:Epoch[0] Batch [100] Speed: 27.51 samples/sec accuracy=0.261719
INFO:root:Epoch[0] Batch [100] Speed: 27.31 samples/sec accuracy=0.261719

This is as the same as the situation that each worker works independently.

Obviously, the option _--kv-store distsync doesn't work.

I wonder that, if this problem is caused by the shared storage? Such as that the kv store of each worker writes to the same location of the shared storage?

eric-haibin-lin commented 7 years ago

Which version of MXNet are you using? Is it the master branch?

zzningxp commented 7 years ago

git clone from the master head, with commit 4562dd

zzningxp commented 7 years ago

I found that, from dmlc-core/tracker/dmlc_tracker/ssh.py, one compute node should hold both one worker and one server. I am not sure whether this is correct? However, my task management system launch a process on one node exclusively. So, one server and one worker are located on two nodes. I wonder whether this is the problem, and I will try to fix it.

------

I successfully made one worker and one server on one node, however, the problem is not solved. The accuracy is still about 0.1 after many batches.

zzningxp commented 7 years ago

I found that, in my environment, the submit node cannot run GPU scripts. So that the scheduler role from the PSTracker __init__ should not be allocated on the same host of the submit node. I rewrite the tracker and run the training program now. And I hope this maybe the reason...

------

oops. The accuracy is still about 0.1 after many batches.

zzningxp commented 7 years ago

I found the reason. The deps of ps-lite, zeromq and protobuf, is not working. However, it still does not work after I re-download the source packages and re-make the project..

eric-haibin-lin commented 7 years ago

what kind of error messages did you see?

On 2017年10月10日 周二 at 00:46 zzningxp notifications@github.com wrote:

I found the reason. The deps of ps-lite, zeromq and protobuf, is not working. However, it still does not work after I re-download the source packages and re-make the project..

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/apache/incubator-mxnet/issues/8077#issuecomment-335389269, or mute the thread https://github.com/notifications/unsubscribe-auth/AFSeqFTsEFvKkeXLl86SHnQr9lU5DL3cks5sqyC8gaJpZM4PnZUg .

-- Best Regards, Haibin Lin

Department of Computer Science School of Computer Science Carnegie Mellon University

zzningxp commented 7 years ago

1) there is no network flows. 2) there is no error message. 3) the results are as the same as before, which is because the weights are not updated anyway.

I thought that the problem is that the message queue is not working to send any data out.

zzningxp commented 7 years ago

My environment is a slurm organized HPC cluster with shared storage or NFS, and the OS is booted from NFS. It is special.

zzningxp commented 7 years ago

Also, while I set PS_VERBOSE=2 to debug, I compare the bug system and a normal system (common environment with the same version of source and same configuration). In the normal system, there are several messages between ImageRecordIOParser2 and starting the first epoch, as below:

[07:58:59] src/van.cc:161: 9 => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=4 }
[07:58:59] src/van.cc:291: Barrier count for 4 : 1
[07:58:59] src/van.cc:161: 11 => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=4 }
[07:58:59] src/van.cc:291: Barrier count for 4 : 2
[07:58:59] src/van.cc:136: ? => 9. Meta: request=0, timestamp=11, control={ cmd=BARRIER, barrier_group=0 }
[07:58:59] src/van.cc:136: ? => 11. Meta: request=0, timestamp=12, control={ cmd=BARRIER, barrier_group=0 }
[07:58:59] src/van.cc:161: 11 => 1. Meta: request=1, timestamp=3, control={ cmd=BARRIER, barrier_group=4 }
[07:58:59] src/van.cc:291: Barrier count for 4 : 1
[07:58:59] src/van.cc:161: 9 => 1. Meta: request=1, timestamp=3, control={ cmd=BARRIER, barrier_group=4 }
[07:58:59] src/van.cc:291: Barrier count for 4 : 2
...
...
...

However, the bug system does not contain such of messages, and it has no message in this phase.

Further, the phases of node assignment and recycling, both system get the same debug messages.

zzningxp commented 7 years ago

Problem solved.

Version bug. Master branch, 4562ddd54917e611a47122bd502059e0889a76b9, is with this bug. I changed to v0.11 branch, 77c50791d4ee87544b04d8517941b437d8231f2f, which is without this bug.

PS. Last message, the normal system is with both source installed python version and pip installed version. And, the pip version is v0.11r3 and has higher priority. The bug will appear on the normal system after the pip version is uninstalled.

This issue can be closed. Thank you.

feevos commented 6 years ago

Hi @zzningxp would it please be possible to share your slurm launch file? I have a different problem which is launching distributed under the slurm environment (any pointers/guidelines to documentation most appreciated). See this for more details.

This is my SLURM launch file (and it doesn't work):

#!/bin/bash -l

#SBATCH --job-name="DSTR"
#SBATCH --job-name="DSTR"
#SBATCH -t 00:03:30
#SBATCH --nodes=4
#SBATCH --cpus-per-task=28
#SBATCH --gres=gpu:4
#SBATCH --mem=128gb

./get_nodes_ip.sh > workers_ip.txt

srun python /data/dia021/Software/mxnet/tools/launch.py  -n $(wc -l < workers_ip.txt) -s  $(wc -l < workers_ip.txt) -H workers_ip.txt --sync-dst-dir /home/dia021/Projects/isprs_potsdam/distributed

Thanks