apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.76k stars 6.8k forks source link

Distributed training over Infiniband #1623

Closed vcodreanu closed 6 years ago

vcodreanu commented 8 years ago

I am experiencing some issues when using distributed training in combination with DMLC_INTERFACE="ib0". I have successfully trained many models using the default DMLC_INTERFACE (eth0), but I've hit some bandwidth limits on some large models and thus tried the Infiniband option.

I am using the dmlc_mpi launcher and the output is as follows:

With DMLC_INTERFACE not set mxnet starts successfully:

Currently Loaded Modulefiles:
 1) cudnn/7.0-v4-prod            4) cuda/7.5.18                 
 2) python/2.7.9                 5) opencv/gnu/2.4.10           
 3) mxnet/2016.02.23             6) mpi/mvapich2-gdr/2.1-cuda75 
mpirun -n 4  -env DMLC_ROLE server -env DMLC_PS_ROOT_PORT 9894 -env DMLC_PS_ROOT_URI 10.3.200.83 -env DMLC_NUM_SERVER 4 -env DMLC_NUM_WORKER 4 --hostfile /home/valeriuc/work/mxnet/example/image-classification/hosts python /home/valeriuc/work/mxnet/example/image-classification/train_imagenet_full.py --batch-size 96 --lr 0.05 --lr-factor .94 --gpus 0,1 --kv-store dist_sync --data-dir /projects/2/managed_datasets/imagenet-full --network inception-v3-full --model-prefix model/ilsvrc21k-8n
mpirun -n 4  -env DMLC_ROLE worker -env DMLC_PS_ROOT_PORT 9894 -env DMLC_PS_ROOT_URI 10.3.200.83 -env DMLC_NUM_SERVER 4 -env DMLC_NUM_WORKER 4 --hostfile /home/valeriuc/work/mxnet/example/image-classification/hosts python /home/valeriuc/work/mxnet/example/image-classification/train_imagenet_full.py --batch-size 96 --lr 0.05 --lr-factor .94 --gpus 0,1 --kv-store dist_sync --data-dir /projects/2/managed_datasets/imagenet-full --network inception-v3-full --model-prefix model/ilsvrc21k-8n
2016-03-11 10:48:16,220 Node[1] start with arguments Namespace(batch_size=96, clip_gradient=5.0, data_dir='/projects/2/managed_datasets/imagenet-full', data_shape=299, gpus='0,1', kv_store='dist_sync', load_epoch=None, log_dir='/tmp/', log_file=None, lr=0.05, lr_factor=0.94, lr_factor_epoch=1, model_prefix='model/ilsvrc21k-8n', network='inception-v3-full', num_classes=21841, num_epochs=20, num_examples=14192019, train_dataset='train.rec', val_dataset='val.rec')
2016-03-11 10:48:16,221 Node[3] start with arguments Namespace(batch_size=96, clip_gradient=5.0, data_dir='/projects/2/managed_datasets/imagenet-full', data_shape=299, gpus='0,1', kv_store='dist_sync', load_epoch=None, log_dir='/tmp/', log_file=None, lr=0.05, lr_factor=0.94, lr_factor_epoch=1, model_prefix='model/ilsvrc21k-8n', network='inception-v3-full', num_classes=21841, num_epochs=20, num_examples=14192019, train_dataset='train.rec', val_dataset='val.rec')
2016-03-11 10:48:16,220 Node[2] start with arguments Namespace(batch_size=96, clip_gradient=5.0, data_dir='/projects/2/managed_datasets/imagenet-full', data_shape=299, gpus='0,1', kv_store='dist_sync', load_epoch=None, log_dir='/tmp/', log_file=None, lr=0.05, lr_factor=0.94, lr_factor_epoch=1, model_prefix='model/ilsvrc21k-8n', network='inception-v3-full', num_classes=21841, num_epochs=20, num_examples=14192019, train_dataset='train.rec', val_dataset='val.rec')
2016-03-11 10:48:16,221 Node[0] start with arguments Namespace(batch_size=96, clip_gradient=5.0, data_dir='/projects/2/managed_datasets/imagenet-full', data_shape=299, gpus='0,1', kv_store='dist_sync', load_epoch=None, log_dir='/tmp/', log_file=None, lr=0.05, lr_factor=0.94, lr_factor_epoch=1, model_prefix='model/ilsvrc21k-8n', network='inception-v3-full', num_classes=21841, num_epochs=20, num_examples=14192019, train_dataset='train.rec', val_dataset='val.rec')
[10:48:16] src/io/iter_image_recordio.cc[10:48:16] src/io/iter_image_recordio.cc:[10:48:16] src/io/iter_image_recordio.cc::212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/train.rec, use 1 threads for decoding..
212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/train.rec, use 1 threads for decoding..
212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/train.rec, use 1 threads for decoding..
[10:48:16] src/io/iter_image_recordio.cc:212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/train.rec, use 1 threads for decoding..
[10:48:17] src/io/iter_image_recordio.cc:212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/val.rec, use 1 threads for decoding..
[10:48:17] src/io/iter_image_recordio.cc:212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/val.rec, use 1 threads for decoding..
[10:48:17] src/io/iter_image_recordio.cc:212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/val.rec, use 1 threads for decoding..
[10:48:17] src/io/iter_image_recordio.cc:212: ImageRecordIOParser: /projects/2/managed_datasets/imagenet-full/val.rec, use 1 threads for decoding..
2016-03-11 10:48:18,321 Node[1] Start training with [gpu(0), gpu(1)]
2016-03-11 10:48:18,351 Node[2] Start training with [gpu(0), gpu(1)]
2016-03-11 10:48:18,359 Node[3] Start training with [gpu(0), gpu(1)]
2016-03-11 10:48:18,396 Node[0] Start training with [gpu(0), gpu(1)]

With DMLC_INTERFACE="ib0" mxnet freezes:

Currently Loaded Modulefiles:
 1) cudnn/7.0-v4-prod            4) cuda/7.5.18                 
 2) python/2.7.9                 5) opencv/gnu/2.4.10           
 3) mxnet/2016.02.23             6) mpi/mvapich2-gdr/2.1-cuda75 
mpirun -n 4  -env DMLC_ROLE server -env DMLC_PS_ROOT_PORT 9986 -env DMLC_PS_ROOT_URI 10.3.200.83 -env DMLC_NUM_SERVER 4 -env DMLC_NUM_WORKER 4 --hostfile /home/valeriuc/work/mxnet/example/image-classification/hosts python /home/valeriuc/work/mxnet/example/image-classification/train_imagenet_full.py --batch-size 96 --lr 0.05 --lr-factor .94 --gpus 0,1 --kv-store dist_sync --data-dir /projects/2/managed_datasets/imagenet-full --network inception-v3-full --model-prefix model/ilsvrc21k-8n
mpirun -n 4  -env DMLC_ROLE worker -env DMLC_PS_ROOT_PORT 9986 -env DMLC_PS_ROOT_URI 10.3.200.83 -env DMLC_NUM_SERVER 4 -env DMLC_NUM_WORKER 4 --hostfile /home/valeriuc/work/mxnet/example/image-classification/hosts python /home/valeriuc/work/mxnet/example/image-classification/train_imagenet_full.py --batch-size 96 --lr 0.05 --lr-factor .94 --gpus 0,1 --kv-store dist_sync --data-dir /projects/2/managed_datasets/imagenet-full --network inception-v3-full --model-prefix model/ilsvrc21k-8n

And on the computing nodes I see processes like:

/hpc/sw/mvapich2-gdr-2.1-cuda-7.5-intel/bin/hydra_pmi_proxy --control-port gcn2:51087 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
valeriuc  68642  68543  1 10:45 ?        00:00:00           python /home/valeriuc/work/mxnet/example/image-classification/train_imagenet_full.py --batch-size 96 --lr 0.05 --lr-factor .94 --gpus 0,1 --kv-store dist_sync --data-dir /projects/2/managed_datasets/imagenet-full --network inception-v3-full --model-prefix model/ilsvrc21k-8n

When I Ctrl-C the process I get:

Press Ctrl-C again to force abort
Traceback (most recent call last):
  File "dmlc_slurm.py", line 92, in <module>
    pscmd=(' '.join(args.command) + ' ' + ' '.join(unknown)))
  File "/hpc/sw/mxnet-2016.02.23/tracker/tracker.py", line 424, in submit
    pserver.join()
  File "/hpc/sw/mxnet-2016.02.23/tracker/tracker.py", line 358, in join
    self.thread.join(100)
  File "/hpc/sw/python-2.7.9/lib/python2.7/threading.py", line 960, in join
    self.__block.wait(delay)
  File "/hpc/sw/python-2.7.9/lib/python2.7/threading.py", line 359, in wait
    _sleep(delay)
KeyboardInterrupt

Our system has multiple NICs (ib0 and ib1) and I get the same behavior when setting any of them. Also, it happens in the case of both kv-stores: dist_sync and dist_async.

Could you advise me on what to try? Or what can I do to add get more verbose debug information from mxnet?

piiswrong commented 8 years ago

@mli

mli commented 8 years ago

can you check two things

  1. check the ib0 is working, i.e. get the ip of ib0 on a machine, then ping that ip from another machine.
  2. check if ps-lite get the ib0's ip correct, you can print my_node_.hostname and my_node_.port at

https://github.com/dmlc/ps-lite/blob/ca2a28e27a6d3b305d14222f5aa44d419a1a8c14/src/van.cc#L52

vcodreanu commented 8 years ago

thanks for the answer.

please see the output below:

ifconfing on the participating nodes:

1st node

[valeriuc@gcn8 ~]$ ifconfig eth0 Link encap:Ethernet HWaddr 08:00:38:3A:7F:D4
inet addr:10.3.200.89 Bcast:10.3.207.255 Mask:255.255.248.0 inet6 addr: fe80::a00:38ff:fe3a:7fd4/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:317638866 errors:0 dropped:0 overruns:184 frame:0 TX packets:640769988 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:388633126378 (361.9 GiB) TX bytes:910276194456 (847.7 GiB) Memory:92180000-921fffff

ib0 Link encap:InfiniBand HWaddr A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.202.203.89 Bcast:10.202.255.255 Mask:255.255.0.0 inet6 addr: fe80::a00:3800:13a:7fd7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:632300 errors:0 dropped:0 overruns:0 frame:0 TX packets:631017 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1024 RX bytes:37926660 (36.1 MiB) TX bytes:27893020 (26.6 MiB)

2nd node

[valeriuc@gcn9 ~]$ ifconfig eth0 Link encap:Ethernet HWaddr 08:00:38:3A:7F:F8
inet addr:10.3.200.90 Bcast:10.3.207.255 Mask:255.255.248.0 inet6 addr: fe80::a00:38ff:fe3a:7ff8/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:73720790 errors:0 dropped:0 overruns:3 frame:0 TX packets:573607916 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:46065774320 (42.9 GiB) TX bytes:854911119840 (796.1 GiB) Memory:92180000-921fffff

ib0 Link encap:InfiniBand HWaddr A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.202.203.90 Bcast:10.202.255.255 Mask:255.255.0.0 inet6 addr: fe80::a00:3800:13c:ed44/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:22670 errors:0 dropped:0 overruns:0 frame:0 TX packets:21725 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1024 RX bytes:1352460 (1.2 MiB) TX bytes:1032140 (1007.9 KiB)

ping between the nodes:

[valeriuc@gcn9 ~]$ ping 10.202.203.89 PING 10.202.203.89 (10.202.203.89) 56(84) bytes of data. 64 bytes from 10.202.203.89: icmp_seq=1 ttl=64 time=5.67 ms 64 bytes from 10.202.203.89: icmp_seq=2 ttl=64 time=0.063 ms 64 bytes from 10.202.203.89: icmp_seq=3 ttl=64 time=0.048 ms 64 bytes from 10.202.203.89: icmp_seq=4 ttl=64 time=0.065 ms 64 bytes from 10.202.203.89: icmp_seq=5 ttl=64 time=0.048 ms

print from van.cc:

[00:02:19] src/van.cc:75: mynode.hostname = 10.3.200.82 mynode.port = 9573 [00:02:20] src/van.cc:52: mynode.hostname = 10.202.203.89 mynode.port = 55631 [00:02:20] src/van.cc:75: mynode.hostname = 10.202.203.89 mynode.port = 55631 [00:02:20] src/van.cc:319: my_node.hostname = 10.3.200.82 my_node.port = 9573 i= 0 [00:02:20] src/van.cc:321: node.hostname = 10.202.203.89 node.port = 55631 i= 0 [00:02:20] src/van.cc:52: mynode.hostname = 10.202.203.89 mynode.port = 38399 [00:02:20] src/van.cc:75: mynode.hostname = 10.202.203.89 mynode.port = 38399 [00:02:20] src/van.cc:319: my_node.hostname = 10.3.200.82 my_node.port = 9573 i= 0 [00:02:20] src/van.cc:321: node.hostname = 10.202.203.89 node.port = 38399 i= 0 [00:02:21] src/van.cc:52: mynode.hostname = 10.202.203.90 mynode.port = 38179 [00:02:21] src/van.cc:75: mynode.hostname = 10.202.203.90 mynode.port = 38179 [00:02:21] src/van.cc:319: my_node.hostname = 10.3.200.82 my_node.port = 9573 i= 0 [00:02:21] src/van.cc:321: node.hostname = 10.202.203.90 node.port = 38179 i= 0 [00:02:21] src/van.c10.202.203.90c:52: mynode.hostname = 10.202.203.90 mynode.port = 40717 [00:02:21] src/van.cc:75: mynode.hostname = 10.202.203.90 mynode.port = 40717 [00:02:21] src/van.cc:319: my_node.hostname = 10.3.200.82 my_node.port = 9573 i= 0 [00:02:21] src/van.cc:321: node.hostname = 10.202.203.90 node.port = 40717 i= 0

I have printed mynode.hostname here: https://github.com/dmlc/ps-lite/blob/ca2a28e27a6d3b305d14222f5aa44d419a1a8c14/src/van.cc#L74 and as you can see it's different from the L52. Do you know why is this?

Also, I have printed the mynode.hostname and node.hostname in the loop here: https://github.com/dmlc/ps-lite/blob/ca2a28e27a6d3b305d14222f5aa44d419a1a8c14/src/van.cc#L314

Any ideas?

mli commented 8 years ago

it looks normal to me. so i think all the nodes are connected.

next, can you try to print at the end of Send_() and Recv() with the number of bytes send and recved? we need to check if a node can send data from its ib0 to another node's ib0

vcodreanu commented 8 years ago

here it is:

[00:50:50] src/van.cc:75: mynode.hostname = 10.3.200.82 mynode.port = 9465 [00:50:52] src/van.cc:52: mynode.hostname = 10.202.203.105 mynode.port = 44039 [00:50:52] src/van.cc:75: mynode.hostname = 10.202.203.105 mynode.port = 44039 [00:50:52] src/van.cc:226: send_bytes = 52 [00:50:52] src/van.cc:290: recv_bytes = 57 [00:50:52] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9465 i= 0 [00:50:52] src/van.cc:327: node.hostname = 10.202.203.105 node.port = 44039 i= 0 [00:50:52] src/van.cc:52: mynode.hostname = 10.202.203.105 mynode.port = 34601 [00:50:52] src/van.cc:75: mynode.hostname = 10.202.203.105 mynode.port = 34601 [00:50:52] src/van.cc:226: send_bytes = 57 [00:50:52] src/van.cc:290: recv_bytes = 62 [00:50:52] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9465 i= 0 [00:50:52] src/van.cc:327: node.hostname = 10.202.203.105 node.port = 34601 i= 0 [00:50:58] src/van.cc:52: mynode.hostname = 10.202.203.106 mynode.port = 46964 [00:50:58] src/van.cc:75: mynode.hostname = 10.202.203.106 mynode.port = 46964 [00:50:58] src/van.cc:226: send_bytes = 52 [00:50:58] src/van.cc:290: recv_bytes = 57 [00:50:58] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9465 i= 0 [00:50:58] src/van.cc:327: node.hostname = 10.202.203.106 node.port = 46964 i= 0 [00:50:58] src/van.cc:52: mynode.hostname = 10.202.203.106 mynode.port = 47818 [00:50:58] src/van.cc:75: mynode.hostname = 10.202.203.106 mynode.port = 47818 [00:50:58] src/van.cc:226: send_bytes = 57 [00:50:58] src/van.cc:290: recv_bytes = 62 [00:50:58] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9465 i= 0 [00:50:58] src/van.cc:327: node.hostname = 10.202.203.106 node.port = 47818 i= 0 [00:50:58] src/van.cc:226: send_bytes = 145 [00:50:58] src/van.cc:226: send_bytes = 145 [00:50:58] src/van.cc:226: send_bytes = 145 [00:50:58] src/van.cc:226: send_bytes = 145 [00:50:58] src/postoffice.cc:59: Between van start and barrier [00:50:58] src/postoffice.cc:105: In Postoffice:: Barrier [00:50:58] src/van.cc:226: send_bytes = 18 [00:50:58] src/van.cc:290: recv_bytes = 21

and with DMLC_INTERFACE=eth0 until the same point (after hitting Barrier):

[00:54:08] src/van.cc:75: mynode.hostname = 10.3.200.82 mynode.port = 9890 [00:54:09] src/van.cc:52: mynode.hostname = 10.3.200.105 mynode.port = 49647 [00:54:09] src/van.cc:75: mynode.hostname = 10.3.200.105 mynode.port = 49647 [00:54:09] src/van.cc:226: send_bytes = 50 [00:54:09] src/van.cc:290: recv_bytes = 55 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9890 i= 0 [00:54:09] src/van.cc:327: node.hostname = 10.3.200.105 node.port = 49647 i= 0 [00:54:09] src/van.cc:52: mynode.hostname = 10.3.200.105 mynode.port = 44416 [00:54:09] src/van.cc:75: mynode.hostname = 10.3.200.105 mynode.port = 44416 [00:54:09] src/van.cc:226: send_bytes = 55 [00:54:09] src/van.cc:290: recv_bytes = 60 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9890 i= 0 [00:54:09] src/van.cc:327: node.hostname = 10.3.200.105 node.port = 44416 i= 0 [00:54:09] src/van.cc:52: mynode.hostname = 10.3.200.106 mynode.port = 47398 [00:54:09] src/van.cc:75: mynode.hostname = 10.3.200.106 mynode.port = 47398 [00:54:09] src/van.cc:226: send_bytes = 50 [00:54:09] src/van.cc:290: recv_bytes = 55 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9890 i= 0 [00:54:09] src/van.cc:327: node.hostname = 10.3.200.106 node.port = 47398 i= 0 [00:54:09] src/van.cc:52: mynode.hostname = 10.3.200.106 mynode.port = 48295 [00:54:09] src/van.cc:75: mynode.hostname = 10.3.200.106 mynode.port = 48295 [00:54:09] src/van.cc:226: send_bytes = 55 [00:54:09] src/van.cc:290: recv_bytes = 60 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.82 my_node.port = 9890 i= 0 [00:54:09] src/van.cc:327: node.hostname = 10.3.200.106 node.port = 48295 i= 0 [00:54:09] src/van.cc:226: send_bytes = 136 [00:54:09] src/van.cc:226: send_bytes = 136 [00:54:09] src/van.cc:226: send_bytes = 136 [00:54:09] src/van.cc:226: send_bytes = 136 [00:54:09[00:54:09] src/van.cc:290: recv_bytes = 139 ] src/van.cc:290: recv_bytes = 139 [00:54:09] src/van.cc:[00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 49647 i= 0 [00:54:09] src/van.cc:327: node.hostname = 10.3.200.105 node.port = 49647 i= 0 [00:54:09] src/van.cc325: my_node.hostname = 10.3.200.106 my_node.port = 47398 i= 0 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 49647 i= 1 [00:54:09] src/van.cc:327: node.hostname = :327: node.hostname = 10.3.200.105 node.port = 49647 i= 0 [00:54:09] src/van.cc10.3.200.105 node.port = 44416 i= 1 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = :325: my_node.hostname = 10.3.200.106 my_node.port = 47398 i= 1 [00:54:09] src/van.cc:49647 i= 2 [00:54:09] src/van.cc:327: node.hostname = 10.3.200.106 node.port = 47398[00:54:09] src/van.cc:290: recv_bytes = 139 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 44416 i= 0 [00:54:09] src/van.cc:327: node.hostname = i= 2 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 49647 i= 310.3.200.105 node.port = 49647 i= 0 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 44416 i= 1 [00:54:09] src/van.cc:327: node.hostname = [00:54:09] src/van.cc:10.3.200.105 node.port = 44416 i= 1 [00:54:09327: node.hostname = 10.3.200.106 node.port = ] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 44416 i= 2 [00:54:0948295 i= 3 ] src/van.cc:327: node.hostname = [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.105 my_node.port = 4964710.3.200.106 node.port = 47398 i= 2 i= 4

[00:54:09] [00:54:09] src/van.cc:327: node.hostname = 10.3.200.82 node.port = src/van.cc:325: my_node.hostname = 327: node.hostname = 10.3.200.105 node.port = 44416 i= 1 [00:54:09] src/van.cc:32510.3.200.105 my_node.port = 44416 i= 39890 i= 4

[00:54:09] src/van.cc:327: node.hostname = 10.3.200.106: my_node.hostname = 10.3.200.106 my_node.port = 47398 i= 2 node.port = 48295 i= 3 [00:54:09] [00:54:09] src/van.cc:327: node.hostname = src/van.cc:325: my_node.hostname = 10.3.200.106 node.port = 47398 i= 210.3.200.105 my_node.port = 44416 i= 4

[00:54:09] [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.106 my_node.port = 47398src/van.cc:327: i= 3 [00:54:09] src/van.cc:327: node.hostname = 10.3.200.82 node.port = 9890 i= 4 node.hostname = 10.3.200.106 node.port = 48295 i= 3 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.106 my_node.port = 47398 i= 4 [00:54:09] src/van.cc:327: node.hostname = 10.3.200.82 node.port = 9890 i= 4 [00:54:09] src/postoffice.cc:59: Between van start and barrier [00:54:09] src/postoffice.cc:105: In Postoffice:: Barrier [00:54:09] src/van.cc:226: send_bytes = 18 [00:54:09] src/postoffice.cc:59: Between van start and barrier [00:54:09] src/postoffice.cc:105: In Postoffice:: Barrier [00:54:09] src/van.cc:290: recv_bytes = 139 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.106 my_node.port = 48295 i= 0 [00:54:09] src/van.cc:327: node.hostname = 10.3.200.105 node.port = 49647 i= 0 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.106 my_node.port = 48295 i= 1 [00:54:09] src/van.cc:327: node.hostname = 10.3.200.105 node.port = 44416 i= 1 [00:54:09] src/van.cc:325: my_node.hostname = 10.3.200.106 my_node.port = 48295 i= 2[00:54:09] [00:54:09] src/van.cc:327: node.hostname = 10.3.200.106 node.port = 47398 i= 2 [src/van.cc:226: send_bytes = 18 .......

The run with eth0 works successfully.

vcodreanu commented 8 years ago

Any ideas where I should look further?

Thanks!

mli commented 8 years ago

i'll try to add a debug option there at this weekend, so you will see all connection activities.

mli commented 8 years ago

can you try with PS_VERBOSE=1?

you need to update ps-lite to the newest version first

cd ps-lite; git pull; 

and then rebuild mxnet

make clean; make;

the document http://ps-lite.readthedocs.org/en/latest/how_to.html#debug-ps-lite

vcodreanu commented 8 years ago

I'm running now with PS_VERBOSE=2.

The run on Infiniband:

[22:17:30] src/van.cc:76: Node Info: role=schedulerid=1, ip=10.3.200.83, port=9177 [22:17:30] src/van.cc:76: Node Info: role=server, ip=10.202.203.97, port=38842 [22:17:30] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.97, port=38842 } } [22:17:30] src/van.cc:76: Node Info: role=worker, ip=10.202.203.97, port=51576 [22:17:30] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.97, port=51576 } } [22:17:30] src/van.cc:76: Node Info: role=server, ip=10.202.203.98, port=56751 [22:17:30] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.98, port=56751 } } [22:17:30] src/van.cc:76: Node Info: role=worker, ip=10.202.203.98, port=47467 [22:17:30] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.98, port=47467 } } [22:17:30] src/van.cc:344: assign rank=9 to node role=worker, ip=10.202.203.98, port=47467 [22:17:30] src/van.cc:344: assign rank=8 to node role=server, ip=10.202.203.98, port=56751 [22:17:30] src/van.cc:344: assign rank=10 to node role=server, ip=10.202.203.97, port=38842 [22:17:30] src/van.cc:344: assign rank=11 to node role=worker, ip=10.202.203.97, port=51576 [22:17:30] src/van.cc:226: H[1] => 9: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=workerid=9, ip=10.202.203.98, port=47467 role=serverid=8, ip=10.202.203.98, port=56751 role=serverid=10, ip=10.202.203.97, port=38842 role=workerid=11, ip=10.202.203.97, port=51576 role=schedulerid=1, ip=10.3.200.83, port=9177 } } [22:17:30] src/van.cc:226: H[1] => 11: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=workerid=9, ip=10.202.203.98, port=47467 role=serverid=8, ip=10.202.203.98, port=56751 role=serverid=10, ip=10.202.203.97, port=38842 role=workerid=11, ip=10.202.203.97, port=51576 role=schedulerid=1, ip=10.3.200.83, port=9177 } } [22:17:30] src/van.cc:226: H[1] => 8: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=workerid=9, ip=10.202.203.98, port=47467 role=serverid=8, ip=10.202.203.98, port=56751 role=serverid=10, ip=10.202.203.97, port=38842 role=workerid=11, ip=10.202.203.97, port=51576 role=schedulerid=1, ip=10.3.200.83, port=9177 } } [22:17:30] src/van.cc:226: H[1] => 10: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=workerid=9, ip=10.202.203.98, port=47467 role=serverid=8, ip=10.202.203.98, port=56751 role=serverid=10, ip=10.202.203.97, port=38842 role=workerid=11, ip=10.202.203.97, port=51576 role=schedulerid=1, ip=10.3.200.83, port=9177 } } [22:17:30] src/van.cc:355: the scheduler is connected to 2 workers and 2 servers [22:17:30] src/van.cc:226: H[1] => 1: Meta: request=1, push=42, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }

The run on ethernet up to the same point:

[22:18:04] src/van.cc:76: Node Info: role=schedulerid=1, ip=10.3.200.83, port=9561 [22:18:05] src/van.cc:76: Node Info: role=server, ip=10.3.200.97, port=44366 [22:18:05] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.3.200.97, port=44366 } } [22:18:05] src/van.cc:76: Node Info: role=worker, ip=10.3.200.97, port=47885 [22:18:05] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.3.200.97, port=47885 } } [22:18:05] src/van.cc:76: Node Info: role=server, ip=10.3.200.98, port=49326 [22:18:05] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.3.200.98, port=49326 } } [22:18:05] src/van.cc:76: Node Info: role=worker, ip=10.3.200.98, port=50623 [22:18:05] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.3.200.98, port=50623 } } [22:18:05] src/van.cc:344: assign rank=8 to node role=server, ip=10.3.200.98, port=49326 [22:18:05] src/van.cc:344: assign rank=9 to node role=worker, ip=10.3.200.98, port=50623 [22:18:05] src/van.cc:344: assign rank=10 to node role=server, ip=10.3.200.97, port=44366 [22:18:05] src/van.cc:344: assign rank=11 to node role=worker, ip=10.3.200.97, port=47885 [22:18:05] src/van.cc:226: H[1] => 9: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.3.200.98, port=49326 role=workerid=9, ip=10.3.200.98, port=50623 role=serverid=10, ip=10.3.200.97, port=44366 role=workerid=11, ip=10.3.200.97, port=47885 role=schedulerid=1, ip=10.3.200.83, port=9561 } } [22:18:05] src/van.cc:226: H[1] => 11: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.3.200.98, port=49326 role=workerid=9, ip=10.3.200.98, port=50623 role=serverid=10, ip=10.3.200.97, port=44366 role=workerid=11, ip=10.3.200.97, port=47885 role=schedulerid=1, ip=10.3.200.83, port=9561 } } [22:18:05] src/van.cc:226: H[1] => 8: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.3.200.98, port=49326 role=workerid=9, ip=10.3.200.98, port=50623 role=serverid=10, ip=10.3.200.97, port=44366 role=workerid=11, ip=10.3.200.97, port=47885 role=schedulerid=1, ip=10.3.200.83, port=9561 } } [22:18:05] src/van.cc:226: H[1] => 10: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.3.200.98, port=49326 role=workerid=9, ip=10.3.200.98, port=50623 role=serverid=10, ip=10.3.200.97, port=44366 role=workerid=11, ip=10.3.200.97, port=47885 role=schedulerid=1, ip=10.3.200.83, port=9561 } } [22:18:05] src/van.cc:355: the scheduler is connected to 2 workers and 2 servers [22:18:05] src/van.cc:[22:18:05] src/van.cc:362362: W[9] is connected to others [22:18:05] src/van.cc:362: S[10] is connected to others : W[11] is connected to others [22:18:05] src/van.cc:362: S[8] is connected to others [22:18:05] src/van.cc:226: H[1] => 1: Meta: request=1, push=42, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [22:18:05] src/van.cc:226: W[9] => 1: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [22:18:05] src/van.cc:226: S[8] => 1: Meta: request=1, push=43, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [22:18:05] src/van.cc:226: W[11] => [22:18:05] src/van.cc:226: S[10] => 11: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } : Meta: request=1, push=43, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [22:18:05] src/van.cc:226: H[1] => 9: Meta: request=0, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=0 } [22:18:05] src/van.cc:226: H[1] => 11: Meta: request=0, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=0 } [22:18:05] src/van.cc:226: H[1] => 8: Meta: request=0, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=0 } [22:18:05] src/van.cc:226: H[1] => 10: Meta: request=0, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=0 } [22:18:05] src/van.cc:226: H[1] => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=0 } [22:18:05] src/van.cc[22:18:05] src/van.cc:226: H[1] => 1: :226: W[9] => 8: Meta: request=1, push=0, simple_app=1, customer_id=0, timestamp=0, head=-2 Meta: request=1, push=42, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }

thanks for the help!

mli commented 8 years ago

it seems that sending data over ib0 is failed. you can double check with it by pull the recent ps-lite, where receiving is also logged.

i also added a --host-ip option in the mpi tracker, see https://github.com/dmlc/ps-lite/commit/f2ab107e2d72123d653633e60be2202bd59a5432

if you start your job on gcn8 with ib0, can you add the option --host-ip 10.202.203.89 and try again?

vcodreanu commented 8 years ago

yes, it seems that this is the case. I started the job directly from a compute node (before I was starting it from a different "scheduler" node) and it goes a bit forward ahead. But it goes only on the server/worker placed on the node that launches the job (10.3.200.92 in this case).

[00:27:41] src/van.cc:76: Bind to role=schedulerid=1, ip=10.3.200.92, port=9876 [00:27:41] src/van.cc:76: Bind to role=server, ip=10.202.203.92, port=40666 [00:27:41] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.92, port=40666 } } [00:27:41] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.92, port=40666 } } [00:27:41] src/van.cc:76: Bind to role=worker, ip=10.202.203.92, port=38127 [00:27:41] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.92, port=38127 } } [00:27:41] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.92, port=38127 } } [00:27:42] src/van.cc:76: Bind to role=server, ip=10.202.203.93, port=49706 [00:27:42] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.93, port=49706 } } [00:27:42] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.93, port=49706 } } [00:27:42] src/van.cc:76: Bind to role=worker, ip=10.202.203.93, port=58320 [00:27:42] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.93, port=58320 } } [00:27:42] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.93, port=58320 } } [00:27:42] src/van.cc:348: assign rank=8 to node role=server, ip=10.202.203.93, port=49706 [00:27:42] src/van.cc:348: assign rank=9 to node role=worker, ip=10.202.203.93, port=58320 [00:27:42] src/van.cc:348: assign rank=11 to node role=worker, ip=10.202.203.92, port=38127 [00:27:42] src/van.cc:348: assign rank=10 to node role=server, ip=10.202.203.92, port=40666 [00:27:42] src/van.cc:226: H[1] => 9: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } } [00:27:42] src/van.cc:226: H[1] => 11: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } } [00:27:42] src/van.cc:226: H[1] => 8: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } } [00:27:42] src/van.cc:226: H[1] => 10: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } } [00:27:42] src/van.cc:359: the scheduler is connected to 2 workers and 2 servers [00:27:42] src/van.cc:291: W <= 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } } [00:27:42] src/van.cc:291: S <= 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } } [00:27:42] src/van.cc:366: W[11] is connected to others [00:27:42] src/van.cc:366: S[10] is connected to others [00:27:42] src/van.cc:226: S[10] => 1: Meta: request=1, push=42, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [00:27:42] src/van.cc:226: W[11] => 1: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [00:27:42] src/van.cc:226: H[1] => 1: Meta: request=1, push=42, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [00:27:42] src/van.cc:291: H[1] <= 1: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [00:27:42] src/van.cc:291: H[1] <= 10: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [00:27:42] src/van.cc:291: H[1] <= 11: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }

but it seems strange...we have many mpi programs running over ib0. any ideas?

mli commented 8 years ago

the problem seems that the scheduler, which uses eth0, is failed to send data to another machine's ib0 interface.

can you try the --host-ip way? namely let the scheduler also use the ib0 interface.

On Tue, Mar 22, 2016 at 7:33 PM, vcodreanu notifications@github.com wrote:

yes, it seems that this is the case. I started the job directly from a compute node (before I was starting it from a different "scheduler" node) and it goes a bit forward ahead. But it goes only on the server/worker placed on the node that launches the job (10.3.200.92 in this case).

[00:27:41] src/van.cc:76: Bind to role=schedulerid=1, ip=10.3.200.92, port=9876 [00:27:41] src/van.cc:76: Bind to role=server, ip=10.202.203.92, port=40666 [00:27:41] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.92, port=40666 } } [00:27:41] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.92, port=40666 } } [00:27:41] src/van.cc:76: Bind to role=worker, ip=10.202.203.92, port=38127 [00:27:41] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.92, port=38127 } } [00:27:41] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.92, port=38127 } } [00:27:42] src/van.cc:76: Bind to role=server, ip=10.202.203.93, port=49706 [00:27:42] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.93, port=49706 } } [00:27:42] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.93, port=49706 } } [00:27:42] src/van.cc:76: Bind to role=worker, ip=10.202.203.93, port=58320 [00:27:42] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.93, port=58320 } } [00:27:42] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.93, port=58320 } } [00:27:42] src/van.cc:348: assign rank=8 to node role=server, ip=10.202.203.93, port=49706 [00:27:42] src/van.cc:348: assign rank=9 to node role=worker, ip=10.202.203.93, port=58320 [00:27:42] src/van.cc:348: assign rank=11 to node role=worker, ip=10.202.203.92, port=38127 [00:27:42] src/van.cc:348: assign rank=10 to node role=server, ip=10.202.203.92, port=40666 [00:27:42] src/van.cc:226: H[1] => 9: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } } [00:27:42] src/van.cc:226: H[1] => 11: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } } [00:27:42] src/van.cc:226: H[1] => 8: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } } [00:27:42] src/van.cc:226: H[1] => 10: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } } [00:27:42] src/van.cc:359: the scheduler is connected to 2 workers and 2 servers [00:27:42] src/van.cc:291: W <= 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } } [00:27:42] src/van.cc:291: S <= 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=serverid=8, ip=10.202.203.93, port=49706 role=workerid=9, ip=10.202.203.93, port=58320 role=workerid=11, ip=10.202.203.92, port=38127 role=serverid=10, ip=10.202.203.92, port=40666 role=schedulerid=1, ip=10.3.200.92, port=9876 } } [00:27:42] src/van.cc:366: W[11] is connected to others [00:27:42] src/van.cc:366: S[10] is connected to others [00:27:42] src/van.cc:226: S[10] => 1: Meta: request=1, push=42, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [00:27:42] src/van.cc:226: W[11] => 1: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [00:27:42] src/van.cc:226: H[1] => 1: Meta: request=1, push=42, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [00:27:42] src/van.cc:291: H[1] <= 1: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [00:27:42] src/van.cc:291: H[1] <= 10: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 } [00:27:42] src/van.cc:291: H[1] <= 11: Meta: request=1, push=0, simple_app=0, control={ cmd=BARRIER, barrier_group=7 }

but it seems strange...we have many mpi programs running over ib0. any ideas?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/dmlc/mxnet/issues/1623#issuecomment-200079309

vcodreanu commented 8 years ago

but in the last test the scheduler is on the compute node, so on ib0, and it says:

[00:27:42] src/van.cc:366: W[11] is connected to others [00:27:42] src/van.cc:366: S[10] is connected to others

so I suppose it misses S[8] and W[9] that should sit on the other ip. but why?

ib bandwidth/latency tests between nodes work without any problems.

mli commented 8 years ago

the scheduler still uses the eth0, whose ip is obtained by tracker/tracker.py and ignoress the DMLC_interface option... namely

[00:27:41] src/van.cc:76: Bind to role=scheduler id=1, ip=10.3.200.92, port=9876

On Tue, Mar 22, 2016 at 7:54 PM, vcodreanu notifications@github.com wrote:

but in the last test the scheduler is on the compute node, so on ib0, and it says:

[00:27:42] src/van.cc:366: W[11] is connected to others [00:27:42] src/van.cc:366: S[10] is connected to others

so I suppose it misses S[8] and W[9] that should sit on the other ip. but why?

ib bandwidth/latency tests between nodes work without any problems.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/dmlc/mxnet/issues/1623#issuecomment-200083698

vcodreanu commented 8 years ago

yes, now when it uses ib0 it freezes earlier:

[01:13:56] src/van.cc:76: Bind to role=schedulerid=1, ip=10.202.203.92, port=9984 [01:13:56] src/van.cc:76: Bind to role=server, ip=10.202.203.92, port=35198 [01:13:56] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.92, port=35198 } } [01:13:56] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.92, port=35198 } } [01:13:56] src/van.cc:76: Bind to role=worker, ip=10.202.203.92, port=53948 [01:13:56] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.92, port=53948 } } [01:13:56] src/van.cc:291: H[1] <= 2147483647: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.92, port=53948 } } [01:13:57] src/van.cc:76: Bind to role=server, ip=10.202.203.93, port=37327 [01:13:57] src/van.cc:226: S => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.202.203.93, port=37327 } } [01:13:57] src/van.cc:76: Bind to role=worker, ip=10.202.203.93, port=50676 [01:13:57] src/van.cc:226: W => 1: Meta: request=0, push=0, simple_app=0, control={ cmd=ADD_NODE, node={ role=worker, ip=10.202.203.93, port=50676 } }

mli commented 8 years ago

that means zmq is failed to send data on infiniband with the tcp protocol.

can you try sdp instead. to use it, you need to hack ps-lite a little bit.

  1. go to ps-lite/src/van.cc, then replace the two tcp: by sdp:.
  2. make clean; make -j8 on mxnet's root, it should recompile ps-lite

i don't have infiniband at hand, so not sure if the above solution works...

chongyang-xu commented 6 years ago

Hi, I met the same issue.Did you succeed to train over ib? @vcodreanu I replaced the two tcp: by sdp: in ps-lite/src/van.cc. But It didn't work on my environment.