Open BelhalK opened 4 years ago
Hi! Could you please let us know what version of PyTorch you're using? Also, if you have a minimal example to reproduce the issue that would be helpful.
Absolutely. I am using Pytorch 1.4.0. And running exactly the gossip_sgd.py script as it is in the repo. I use MPI on kubernetes running:
#!/bin/bash
export LD_LIBRARY_PATH=/home/work/cuda-9.0/lib64/:/home/work/cuda-9.0/lib/:/home/work/cuda-9.0/extras/CUPTI/lib64:/home/work/cudnn/\
cudnn_v7/cuda/lib64:$LD_LIBRARY_PATH
export CPATH=/home/work/cudnn/cudnn_v7/cuda/include:$CPATH
export LIBRARY_PATH=/home/work/cudnn/cudnn_v7/cuda/lib64:$LIBRARY_PATH
rank_0_ip=${POD_0_IP}
free_port=${TRAINER_PORTS}
echo "rank 0 ip: ${rank_0_ip}"
mpirun /opt/conda/envs/py36/bin/pip install mpi4py
set -x
mpirun /opt/conda/envs/py36/bin/python -u gossip_sgd.py --backend nccl --checkpoint_dir ./checkpoints\
--batch_size 256 --lr 0.1 --num_dataloader_workers 10 \
--num_epochs 5 --nesterov True --warmup True --push_sum False \
--graph_type 1 --schedule 30 0.1 60 0.1 80 0.1 \
--train_fast False \
--tag 'DPSGD_IB' --print_freq 100 --verbose False \
--all_reduce False --seed 1 \
--network_interface_type 'infiniband' --master_addr ${rank_0_ip} --master_port ${free_port} > runing_log.txt
After overcoming all the environment setup issues, I realized the function _make_graph(self)
in graph_manager.py is not defined.
All graph types depend on this function (and on this class GraphManager(object)
).
Is this normal?
The _make_graph is defined in subclasses of GraphManager so that should not be an issue. The superclass is not meant to be used by itself
Seems like it is in an issue. (at least in my run where line 68 i.e. args.graph = graph_class(...)
is blocking the run)
In order to create an object of the child class DynamicDirectedExponentialGraph(GraphManager)
it must create an object of the super class GraphManager. The latter can not be ran since
_make_graph` is not implemented.
I would suppose the _make_graph method should be initialized inside the child class and the init of the super class should be done inside the child class as well. What do you think?
hi @BelhalK, which version of python are you using? In Python 3 I believe that if a method is implemented in the child class it should all be fine. For example, if you try running this little script below, it should print "hello world" and not raise an error. It seems the issue you mentioned earlier was in the dist.barrier() step, is that correct? Can you confirm that the pytorch backend is correctly initialized when you're running it on your cluster?
Unfortunately the pytorch backend doesn't use mpi4py, so you would need to make sure that your mpi instillation has the available primitives. Which MPI distribution are you using?
class MasterGraph(object):
def __init__(self):
self._private_init()
def _private_init(self):
raise NotImplementedError
class ChildGraph(MasterGraph):
def _private_init(self):
print('hello world')
if __name__ == '__main__':
c = ChildGraph()
The version of the code in this repo currently only supports PyTorch 1.0. We're working on updating it, so stay tuned. There are some other errors or warnings I would expect to see if you are running with a version of PyTorch > 1.0, but we haven't previously seen the dist.barrier() error.
Which file are you referring to when you mention line 68? (Or do you mean line 683 in gossip_sgd.py?) The method DynamicDirectedExponentialGraph._make_graph()
is implemented in line 151 of gossip/graph_manager.py
In an earlier comment you mentioned "After overcoming all the environment setup issues...". Are you still seeing the same error message as in your original post? If not, could you please update here with the error you're getting?
Thank you for those useful information all.
Indeed my error is no longer with dist.barrier() but rather with the _make_graph
step.
I tun the same script but with the following argument for the graph --graph_type -1
.
Then in gossip/distributed.py line 107 the graph is supposed to be a NPDDEGraph
.
In graph_manager.py I've made the following modifications (printing) to the NPeerDynamicDirectedExponentialGraph
class's _make_graph method:
def _make_graph(self):
print("world size:", self.world_size)
print("peer per itr:", self._peers_per_itr)
print(int(mlog(self.world_size-1,self._peers_per_itr + 1)) + 1)
for rank in range(self.world_size):
print("rank:", rank)
for i in range(0, int(mlog(self.world_size-1,self._peers_per_itr + 1)) + 1):
print("i:", i)
for j in range(1, self._peers_per_itr + 1):
print("j:", j)
distance_to_neighbor = j * ((self._peers_per_itr + 1) ** i)
f_peer = self._rotate_forward(rank, distance_to_neighbor)
self._add_peers(rank, [f_peer])
The run is blocking at this point and here is the log I get:
Wed Sep 9 15:46:11 2020[1,2]<stdout>:world size: 8
Wed Sep 9 15:46:11 2020[1,2]<stdout>:peer per itr: 1
Wed Sep 9 15:46:11 2020[1,2]<stdout>:3
Wed Sep 9 15:46:11 2020[1,2]<stdout>:rank: 0
Wed Sep 9 15:46:11 2020[1,2]<stdout>:i: 0
Wed Sep 9 15:46:11 2020[1,2]<stdout>:j: 1
Wed Sep 9 15:46:11 2020[1,2]<stdout>:i: 1
Wed Sep 9 15:46:11 2020[1,2]<stdout>:j: 1
Wed Sep 9 15:46:11 2020[1,3]<stdout>:3
Wed Sep 9 15:46:11 2020[1,3]<stdout>:rank: 0
Wed Sep 9 15:46:11 2020[1,3]<stdout>:i: 0
Wed Sep 9 15:46:11 2020[1,3]<stdout>:j: 1
Wed Sep 9 15:46:11 2020[1,3]<stdout>:i: 1
Wed Sep 9 15:46:11 2020[1,3]<stdout>:j: 1
Wed Sep 9 15:46:11 2020[1,3]<stdout>:i: 2
Wed Sep 9 15:46:11 2020[1,3]<stdout>:j: 1
Wed Sep 9 15:46:11 2020[1,3]<stdout>:rank: 1
Wed Sep 9 15:46:11 2020[1,3]<stdout>:i: 0
Wed Sep 9 15:46:11 2020[1,3]<stdout>:j: 1
Wed Sep 9 15:46:11 2020[1,3]<stdout>:i: 1
Wed Sep 9 15:46:11 2020[1,3]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,5]<stdout>:world size: 8
Wed Sep 9 15:46:12 2020[1,5]<stdout>:peer per itr: 1
Wed Sep 9 15:46:12 2020[1,5]<stdout>:3
Wed Sep 9 15:46:12 2020[1,5]<stdout>:rank: 0
Wed Sep 9 15:46:12 2020[1,5]<stdout>:i: 0
Wed Sep 9 15:46:12 2020[1,5]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,5]<stdout>:i: 1
Wed Sep 9 15:46:12 2020[1,5]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,5]<stdout>:i: 2
Wed Sep 9 15:46:12 2020[1,5]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,5]<stdout>:rank: 1
Wed Sep 9 15:46:12 2020[1,5]<stdout>:i: 0
Wed Sep 9 15:46:12 2020[1,5]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,5]<stdout>:i: 1
Wed Sep 9 15:46:12 2020[1,5]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,5]<stdout>:i: 2
Wed Sep 9 15:46:12 2020[1,5]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,4]<stdout>:world size: 8
Wed Sep 9 15:46:12 2020[1,4]<stdout>:peer per itr: 1
Wed Sep 9 15:46:12 2020[1,4]<stdout>:3
Wed Sep 9 15:46:12 2020[1,4]<stdout>:rank: 0
Wed Sep 9 15:46:12 2020[1,4]<stdout>:i: 0
Wed Sep 9 15:46:12 2020[1,4]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,4]<stdout>:i: 1
Wed Sep 9 15:46:12 2020[1,4]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,4]<stdout>:i: 2
Wed Sep 9 15:46:12 2020[1,4]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:Model Init DONE OK
Wed Sep 9 15:46:12 2020[1,6]<stdout>:Optimizer Init DONE OK
Wed Sep 9 15:46:12 2020[1,6]<stdout>:world size: 8
Wed Sep 9 15:46:12 2020[1,6]<stdout>:peer per itr: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:3
Wed Sep 9 15:46:12 2020[1,6]<stdout>:rank: 0
Wed Sep 9 15:46:12 2020[1,6]<stdout>:i: 0
Wed Sep 9 15:46:12 2020[1,6]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:i: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:i: 2
Wed Sep 9 15:46:12 2020[1,6]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:rank: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:i: 0
Wed Sep 9 15:46:12 2020[1,6]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:i: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:i: 2
Wed Sep 9 15:46:12 2020[1,6]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:rank: 2
Wed Sep 9 15:46:12 2020[1,6]<stdout>:i: 0
Wed Sep 9 15:46:12 2020[1,6]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:i: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,6]<stdout>:i: 2
Wed Sep 9 15:46:12 2020[1,6]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:world size: 8
Wed Sep 9 15:46:12 2020[1,7]<stdout>:peer per itr: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:3
Wed Sep 9 15:46:12 2020[1,7]<stdout>:rank: 0
Wed Sep 9 15:46:12 2020[1,7]<stdout>:i: 0
Wed Sep 9 15:46:12 2020[1,7]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:i: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:i: 2
Wed Sep 9 15:46:12 2020[1,7]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:rank: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:i: 0
Wed Sep 9 15:46:12 2020[1,7]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:i: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:i: 2
Wed Sep 9 15:46:12 2020[1,7]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:rank: 2
Wed Sep 9 15:46:12 2020[1,7]<stdout>:i: 0
Wed Sep 9 15:46:12 2020[1,7]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:i: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:i: 2
Wed Sep 9 15:46:12 2020[1,7]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:rank: 3
Wed Sep 9 15:46:12 2020[1,7]<stdout>:i: 0
Wed Sep 9 15:46:12 2020[1,7]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:i: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:j: 1
Wed Sep 9 15:46:12 2020[1,7]<stdout>:i: 2
Wed Sep 9 15:46:12 2020[1,7]<stdout>:j: 1
Wed Sep 9 15:46:13 2020[1,0]<stdout>:world size: 8
Wed Sep 9 15:46:13 2020[1,0]<stdout>:peer per itr: 1
Wed Sep 9 15:46:13 2020[1,0]<stdout>:3
Wed Sep 9 15:46:13 2020[1,0]<stdout>:rank: 0
Wed Sep 9 15:46:13 2020[1,0]<stdout>:i: 0
Wed Sep 9 15:46:13 2020[1,0]<stdout>:j: 1
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1460 [0] NCCL INFO Bootstrap : Using [0]eth0:10.127.6.24<0>
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1460 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1460 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1460 [0] NCCL INFO NET/Socket : Using [0]eth0:10.127.6.24<0>
Wed Sep 9 15:46:13 2020[1,0]<stdout>:NCCL version 2.4.8+cuda10.1
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1486 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
Wed Sep 9 15:46:13 2020[1,1]<stdout>:world size: 8
Wed Sep 9 15:46:13 2020[1,1]<stdout>:peer per itr: 1
Wed Sep 9 15:46:13 2020[1,1]<stdout>:3
Wed Sep 9 15:46:13 2020[1,1]<stdout>:rank: 0
Wed Sep 9 15:46:13 2020[1,1]<stdout>:i: 0
Wed Sep 9 15:46:13 2020[1,1]<stdout>:j: 1
Wed Sep 9 15:46:13 2020[1,1]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1461:1461 [0] NCCL INFO Bootstrap : Using [0]eth0:10.127.6.24<0>
Wed Sep 9 15:46:13 2020[1,1]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1461:1461 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
Wed Sep 9 15:46:13 2020[1,1]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1461:1461 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
Wed Sep 9 15:46:13 2020[1,1]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1461:1461 [0] NCCL INFO NET/Socket : Using [0]eth0:10.127.6.24<0>
Wed Sep 9 15:46:13 2020[1,1]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1461:1488 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1486 [0] NCCL INFO NCCL_P2P_DISABLE set by environment to 0.
Wed Sep 9 15:46:13 2020[1,1]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1461:1488 [0] NCCL INFO NCCL_P2P_DISABLE set by environment to 0.
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1486 [0] NCCL INFO Channel 00 : 0 1
Wed Sep 9 15:46:13 2020[1,1]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1461:1488 [0] NCCL INFO Ring 00 : 1[0] -> 0[0] via P2P/IPC
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1486 [0] NCCL INFO Ring 00 : 0[0] -> 1[0] via P2P/IPC
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1486 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
Wed Sep 9 15:46:13 2020[1,1]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1461:1488 [0] NCCL INFO comm 0x7f6e3496af40 rank 1 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1486 [0] NCCL INFO comm 0x7f2cae2585c0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1460 [0] NCCL INFO Launch mode Parallel
Wed Sep 9 15:46:13 2020[1,0]<stdout>:i: 1
Wed Sep 9 15:46:13 2020[1,0]<stdout>:j: 1
Wed Sep 9 15:46:13 2020[1,1]<stdout>:i: 1
Wed Sep 9 15:46:13 2020[1,1]<stdout>:j: 1
Wed Sep 9 15:46:13 2020[1,1]<stdout>:i: 2
Wed Sep 9 15:46:13 2020[1,1]<stdout>:j: 1
Wed Sep 9 15:46:13 2020[1,1]<stdout>:rank: 1
Wed Sep 9 15:46:13 2020[1,1]<stdout>:i: 0
Wed Sep 9 15:46:13 2020[1,1]<stdout>:j: 1
Wed Sep 9 15:46:13 2020[1,1]<stdout>:NCCL version 2.4.8+cuda10.1
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1492 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
Wed Sep 9 15:46:13 2020[1,1]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1461:1494 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1492 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : NODE
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1492 [0] NCCL INFO Channel 00 : 0 1
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1492 [0] NCCL INFO Channel 01 : 0 1
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1492 [0] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1492 [0] NCCL INFO NET/Socket: Using 1 threads and 1 sockets per thread
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1492 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1492 [0] NCCL INFO Ring 01 : 1 -> 0 [receive] via NET/Socket/0
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1492 [0] NCCL INFO NET/Socket: Using 1 threads and 1 sockets per thread
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1492 [0] NCCL INFO Ring 01 : 0 -> 1 [send] via NET/Socket/0
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1492 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1492 [0] NCCL INFO comm 0x7f2cae2612c0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
Wed Sep 9 15:46:13 2020[1,0]<stdout>:yq01-sys-hic-k8s-v100-box-a225-0338:1460:1460 [0] NCCL INFO Launch mode Parallel
and it blocks there. Seems like for some core the process of making the graph does not go through all the iterations. I'm really not sure why.
The full job instruction is
mpirun /opt/conda/envs/py36/bin/python -u gossip_sgd.py --backend nccl --checkpoint_dir ./checkpoints\
--batch_size 256 --lr 0.1 --num_dataloader_workers 10 \
--num_epochs 5 --nesterov True --warmup True --push_sum False \
--graph_type -1 --schedule 30 0.1 60 0.1 80 0.1 \
--train_fast False \
--tag 'DPSGD_IB' --print_freq 100 --verbose False \
--all_reduce False --seed 1 \
--network_interface_type 'infiniband' --master_addr ${rank_0_ip} --master_port ${free_port} > runing_log.txt
I'm also adding the following error message I get:
Wed Sep 9 21:30:55 2020[1,3]<stderr>: File "gossip_sgd.py", line 989, in <module>
Wed Sep 9 21:30:55 2020[1,3]<stderr>: main()
Wed Sep 9 21:30:55 2020[1,3]<stderr>: File "gossip_sgd.py", line 357, in main
Wed Sep 9 21:30:55 2020[1,3]<stderr>: use_streams=not args.no_cuda_streams,
Wed Sep 9 21:30:55 2020[1,3]<stderr>: File "/workspace/env_run/gossip/distributed.py", line 130, in __init__
Wed Sep 9 21:30:55 2020[1,3]<stderr>: rank, world_size, self.nprocs_per_node, self.local_rank)
Wed Sep 9 21:30:55 2020[1,3]<stderr>: File "/workspace/env_run/gossip/graph_manager.py", line 168, in __init__
Wed Sep 9 21:30:55 2020[1,3]<stderr>: self._make_graph()
Wed Sep 9 21:30:55 2020[1,3]<stderr>: File "/workspace/env_run/gossip/graph_manager.py", line 193, in _make_graph
Wed Sep 9 21:30:55 2020[1,3]<stderr>: self._add_peers(rank, [f_peer])
Wed Sep 9 21:30:55 2020[1,3]<stderr>: File "/workspace/env_run/gossip/graph_manager.py", line 202, in _add_peers
Wed Sep 9 21:30:55 2020[1,3]<stderr>: local_rank=self.local_rank))
Wed Sep 9 21:30:55 2020[1,3]<stderr>: File "/workspace/env_run/gossip/graph_manager.py", line 30, in __init__
Wed Sep 9 21:30:55 2020[1,3]<stderr>: dist.all_reduce(initializer_tensor, group=self.process_group)
Wed Sep 9 21:30:55 2020[1,3]<stderr>: File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 904, in all_reduce
Wed Sep 9 21:30:55 2020[1,3]<stderr>: work = group.allreduce([tensor], opts)
_make_graph()
seems to be the source issue but I don't see how to read that error message. If someone could help it would be appreciated :) .
Bests,
_make_graph()
creates the process groups that will be used for communicating while the algorithm is run (one process group per edge). There is one Edge
object created for each peer, and in the __init__()
method a call is made to dist.all_reduce()
to initialize the process group. It seems that things could be hanging because some peers try to initialize before others are ready.
The call to dist.barrier()
on line 682 in gossip_sgd.py is to be sure that all workers are up and running, to avoid this issue. Is it possible that you commented out that line?
What is the setup you are running on? You mentioned you're using kubernetes. I see you're using the NCCL backend, so MPI is really just being used to launch the jobs (and to determine the rank of each node). World size is 8. Is each kubernetes instance a server with 1 GPU or multiple GPUs?
I also noticed that you're passing --network_interface_type 'infiniband'
but from the NCCL debug output it doesn't look like the system you are using supports infiniband (NCCL INFO NCCL_IB_DISABLE set by environment to 1
) so it is reverting back to Ethernet (NCCL INFO NET/Socket : Using [0]eth0:10.127.6.24<0>
).
Yes you are right. That's a mistake from my end. I am using ethernet as the interface type. Here is the latest job I've launched
mpirun /opt/conda/envs/py36/bin/python -u gossip_sgd.py --backend nccl --checkpoint_dir ./checkpoints/\
--batch_size 256 --lr 0.1 --num_dataloader_workers 4 \
--num_epochs 5 --nesterov True --warmup True --push_sum False \
--graph_type 1 --schedule 30 0.1 60 0.1 80 0.1 \
--train_fast False --master_port 40100 \
--tag 'DPSGD_IB' --print_freq 100 --verbose False \
--all_reduce False --seed 1 \
--network_interface_type 'ethernet' --master_addr ${rank_0_ip} --master_port ${free_port} > runing_log.txt
I am using indeed MPI to launch jobs on a kubernetes cluster and NCCL backend.
I am using mpi_slots_num = 2
(which is the number of mpi tasks working per node) and k8strainers = 2
(which is the number of gpu server I applied my job to) with k8s_gpu_cards=1
(which is the number of card per gpu server).
Latest run showed progress (recall in the job above the world_size is 4):
Thu Sep 10 15:12:28 2020[1,1]<stdout>:Model Init DONE OK
Thu Sep 10 15:12:28 2020[1,1]<stdout>:ok 13
Thu Sep 10 15:12:28 2020[1,2]<stdout>:Model Init DONE OK
Thu Sep 10 15:12:28 2020[1,2]<stdout>:ok 13
Thu Sep 10 15:12:28 2020[1,0]<stdout>:Model Init DONE OK
Thu Sep 10 15:12:28 2020[1,0]<stdout>:ok 13
Thu Sep 10 15:12:28 2020[1,3]<stdout>:Model Init DONE OK
Thu Sep 10 15:12:28 2020[1,3]<stdout>:ok 13
where Model Init DONE OK
is printed line 178 after model = init_model()
in gossip_sgd.py and ok 13
is printed before line 185 self.gossip_flag.wait()
in gossip/distributed.py.
I'm not sure if the following error message is related to the wait()
method:
Thu Sep 10 15:12:28 2020[1,0]<stderr>:Exception in thread Gossip-Thread:
Thu Sep 10 15:12:28 2020[1,0]<stderr>:Traceback (most recent call last):
Thu Sep 10 15:12:28 2020[1,0]<stderr>: File "/opt/conda/envs/py36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
Thu Sep 10 15:12:28 2020[1,0]<stderr>: self.run()
Thu Sep 10 15:12:28 2020[1,0]<stderr>: File "/opt/conda/envs/py36/lib/python3.6/threading.py", line 864, in run
Thu Sep 10 15:12:28 2020[1,0]<stderr>: self._target(*self._args, **self._kwargs)
Thu Sep 10 15:12:28 2020[1,0]<stderr>: File "/root/paddlejob/workspace/env_run/gossip/distributed.py", line 544, in _gossip_target
Thu Sep 10 15:12:28 2020[1,0]<stderr>: flatten_tensors(gossip_params_by_dtype[dtype]),
Thu Sep 10 15:12:28 2020[1,0]<stderr>: File "/root/paddlejob/workspace/env_run/gossip/utils/helpers.py", line 35, in flatten_tensors
Thu Sep 10 15:12:28 2020[1,0]<stderr>: flat = torch.cat([t.view(-1) for t in tensors], dim=0)
Thu Sep 10 15:12:28 2020[1,0]<stderr>:RuntimeError: Error in dlopen or dlsym: libcaffe2_nvrtc.so: cannot open shared object file: No such file or directory
(and I've checked that dist.barrier()
is not commented. It is not a problem anymore since arguments parsing step is executed well)
Latest issue was resolved with export LD_LIBRARY_PATH=/opt/conda/envs/py36/lib/python3.6/site-packages/torch/lib:$LD_LIBRARY_PATH
New issue here:
Thu Sep 10 16:58:50 2020[1,1]<stderr>:Exception in thread Gossip-Thread:
Thu Sep 10 16:58:50 2020[1,1]<stderr>:Traceback (most recent call last):
Thu Sep 10 16:58:50 2020[1,1]<stderr>: File "/opt/conda/envs/py36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
Thu Sep 10 16:58:50 2020[1,1]<stderr>: self.run()
Thu Sep 10 16:58:50 2020[1,1]<stderr>: File "/opt/conda/envs/py36/lib/python3.6/threading.py", line 864, in run
Thu Sep 10 16:58:50 2020[1,1]<stderr>: self._target(*self._args, **self._kwargs)
Thu Sep 10 16:58:50 2020[1,1]<stderr>: File "/root/paddlejob/workspace/env_run/gossip/distributed.py", line 550, in _gossip_target
Thu Sep 10 16:58:50 2020[1,1]<stderr>: logger=logger)
Thu Sep 10 16:58:50 2020[1,1]<stderr>: File "/root/paddlejob/workspace/env_run/gossip/gossiper.py", line 58, in __init__
Thu Sep 10 16:58:50 2020[1,1]<stderr>: assert isinstance(graph, GraphManager)
Thu Sep 10 16:58:50 2020[1,1]<stderr>:AssertionError
Hi there,
I am encountering this issue while using a NCCL backend. Seems like dist.barrier() is the problem.
Should I discard it?
Bests.