Open zzningxp opened 7 years ago
Which version of MXNet are you using? Is it the master branch?
git clone from the master head, with commit 4562dd
I found that, from dmlc-core/tracker/dmlc_tracker/ssh.py, one compute node should hold both one worker and one server. I am not sure whether this is correct? However, my task management system launch a process on one node exclusively. So, one server and one worker are located on two nodes. I wonder whether this is the problem, and I will try to fix it.
------
I successfully made one worker and one server on one node, however, the problem is not solved. The accuracy is still about 0.1 after many batches.
I found that, in my environment, the submit node cannot run GPU scripts. So that the scheduler role from the PSTracker __init__ should not be allocated on the same host of the submit node. I rewrite the tracker and run the training program now. And I hope this maybe the reason...
------
oops. The accuracy is still about 0.1 after many batches.
I found the reason. The deps of ps-lite, zeromq and protobuf, is not working. However, it still does not work after I re-download the source packages and re-make the project..
what kind of error messages did you see?
On 2017年10月10日 周二 at 00:46 zzningxp notifications@github.com wrote:
I found the reason. The deps of ps-lite, zeromq and protobuf, is not working. However, it still does not work after I re-download the source packages and re-make the project..
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/apache/incubator-mxnet/issues/8077#issuecomment-335389269, or mute the thread https://github.com/notifications/unsubscribe-auth/AFSeqFTsEFvKkeXLl86SHnQr9lU5DL3cks5sqyC8gaJpZM4PnZUg .
-- Best Regards, Haibin Lin
Department of Computer Science School of Computer Science Carnegie Mellon University
1) there is no network flows. 2) there is no error message. 3) the results are as the same as before, which is because the weights are not updated anyway.
I thought that the problem is that the message queue is not working to send any data out.
My environment is a slurm organized HPC cluster with shared storage or NFS, and the OS is booted from NFS. It is special.
Also, while I set PS_VERBOSE=2 to debug, I compare the bug system and a normal system (common environment with the same version of source and same configuration). In the normal system, there are several messages between ImageRecordIOParser2 and starting the first epoch, as below:
[07:58:59] src/van.cc:161: 9 => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=4 }
[07:58:59] src/van.cc:291: Barrier count for 4 : 1
[07:58:59] src/van.cc:161: 11 => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=4 }
[07:58:59] src/van.cc:291: Barrier count for 4 : 2
[07:58:59] src/van.cc:136: ? => 9. Meta: request=0, timestamp=11, control={ cmd=BARRIER, barrier_group=0 }
[07:58:59] src/van.cc:136: ? => 11. Meta: request=0, timestamp=12, control={ cmd=BARRIER, barrier_group=0 }
[07:58:59] src/van.cc:161: 11 => 1. Meta: request=1, timestamp=3, control={ cmd=BARRIER, barrier_group=4 }
[07:58:59] src/van.cc:291: Barrier count for 4 : 1
[07:58:59] src/van.cc:161: 9 => 1. Meta: request=1, timestamp=3, control={ cmd=BARRIER, barrier_group=4 }
[07:58:59] src/van.cc:291: Barrier count for 4 : 2
...
...
...
However, the bug system does not contain such of messages, and it has no message in this phase.
Further, the phases of node assignment and recycling, both system get the same debug messages.
Problem solved.
Version bug. Master branch, 4562ddd54917e611a47122bd502059e0889a76b9, is with this bug. I changed to v0.11 branch, 77c50791d4ee87544b04d8517941b437d8231f2f, which is without this bug.
PS. Last message, the normal system is with both source installed python version and pip installed version. And, the pip version is v0.11r3 and has higher priority. The bug will appear on the normal system after the pip version is uninstalled.
This issue can be closed. Thank you.
Hi @zzningxp would it please be possible to share your slurm launch file? I have a different problem which is launching distributed under the slurm environment (any pointers/guidelines to documentation most appreciated). See this for more details.
This is my SLURM launch file (and it doesn't work):
#!/bin/bash -l
#SBATCH --job-name="DSTR"
#SBATCH --job-name="DSTR"
#SBATCH -t 00:03:30
#SBATCH --nodes=4
#SBATCH --cpus-per-task=28
#SBATCH --gres=gpu:4
#SBATCH --mem=128gb
./get_nodes_ip.sh > workers_ip.txt
srun python /data/dia021/Software/mxnet/tools/launch.py -n $(wc -l < workers_ip.txt) -s $(wc -l < workers_ip.txt) -H workers_ip.txt --sync-dst-dir /home/dia021/Projects/isprs_potsdam/distributed
Thanks
Following https://mxnet.incubator.apache.org/how_to/multi_devices.html , using cifar10 dataset and example/image-classification/train_cifar10.py to train a resnet50 model with two worker nodes, I encountered the problem below:
The environment is not general. The storage is shared distributed storage and network file system: Lustre. The task submission uses SLURM. I rewrite a custom launcher, to alter ssh command to slurm command. The launching environment variables is :
1) when using _--kv-store distsync, the log is:
after 32 epoches, the log is:
2) when removed _--kv-store distsync, the log is:
This is as the same as the situation that each worker works independently.
Obviously, the option _--kv-store distsync doesn't work.
I wonder that, if this problem is caused by the shared storage? Such as that the kv store of each worker writes to the same location of the shared storage?