Closed vsuthichai closed 6 years ago
You may need to force each worker to it's own GPU, e.g. by setting CUDA_VISIBLE_DEVICES=$RANK in your for loop. Otherwise, I suspect they all start on GPU 0, you can actually confirm it by running nvidia-smi when the job starts
@edunov Appreciate the quick response, it got past the OOM error, but now runs into RuntimeError: NCCL error in: /pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled cuda error.
Are there any environment variables I need to set? I'm currently on cuda9.2, nccl2.2.13, driver version 396.37
@edunov An update, with nccl debugging turned on, I can see this error now :
ip-10-0-0-168:55603:55931 [0] transport/p2p.cu:526 WARN failed to open CUDA IPC handle : 11 invalid argument
ip-10-0-0-168:55603:55931 [0] INFO init.cu:475 -> 1
ip-10-0-0-168:55603:55931 [0] INFO init.cu:536 -> 1
ip-10-0-0-168:55603:55931 [0] INFO misc/group.cu:70 -> 1 [Async thread]
ip-10-0-0-168:55604:55932 [0] transport/p2p.cu:526 WARN failed to open CUDA IPC handle : 11 invalid argument
ip-10-0-0-168:55604:55932 [0] INFO init.cu:475 -> 1
ip-10-0-0-168:55604:55932 [0] INFO init.cu:536 -> 1
ip-10-0-0-168:55604:55932 [0] INFO misc/group.cu:70 -> 1 [Async thread]
Ouch, sorry, I think I was wrong about CUDA_VISIBLE_DEVICES. Can you try this: 1) Remove CUDA_VISIBLE_DEVICES 2) add --device-id $RANK to train.py
Sorry for the confusion
@edunov that seems to have done the trick. If I could request the getting_started.rst link be updated, that'd be great :)
So it seems the next issue is that the throughput wps has dropped significantly. Running it on a single node with 8 gpus, I can get it close to the published 143k wps. Running the job on two nodes with 16 gpus, the wps is just over 10k.
The two scripts used to launch below:
#!/bin/bash
HOST_PORT="tcp://10.0.0.168:13333"
kill_children() {
for PID in ${PIDS[*]}; do
kill -TERM $PID
done
}
NODE=0
RANKS_PER_NODE=8
for i in $(seq 0 7); do
LOCAL_RANK=$i
DISTRIBUTED_RANK=$((RANKS_PER_NODE * NODE + LOCAL_RANK))
python train.py data-bin/wmt14_en_de_joined_dict \
--arch transformer_vaswani_wmt_en_de_big \
--share-all-embeddings \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--clip-norm 0.0 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 \
--warmup-updates 4000 \
--lr 0.0005 \
--min-lr 1e-09 \
--dropout 0.3 \
--weight-decay 0.0 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--max-tokens 3584 --fp16 \
--distributed-world-size 16 \
--distributed-init-method $HOST_PORT \
--device-id $LOCAL_RANK \
--distributed-rank $DISTRIBUTED_RANK &
PIDS[$RANK]=$!
done
trap kill_children SIGTERM SIGINT
for PID in ${PIDS[*]}; do
wait $PID
done
#!/bin/bash
HOST_PORT="tcp://10.0.0.168:13333"
kill_children() {
for PID in ${PIDS[*]}; do
kill -TERM $PID
done
}
NODE=1
RANKS_PER_NODE=8
for i in $(seq 0 7); do
LOCAL_RANK=$i
DISTRIBUTED_RANK=$((RANKS_PER_NODE * NODE + LOCAL_RANK))
python train.py data-bin/wmt14_en_de_joined_dict \
--arch transformer_vaswani_wmt_en_de_big \
--share-all-embeddings \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--clip-norm 0.0 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 \
--warmup-updates 4000 \
--lr 0.0005 \
--min-lr 1e-09 \
--dropout 0.3 \
--weight-decay 0.0 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--max-tokens 3584 --fp16 \
--distributed-world-size 16 \
--distributed-init-method $HOST_PORT \
--device-id $LOCAL_RANK \
--distributed-rank $DISTRIBUTED_RANK &
PIDS[$RANK]=$!
done
trap kill_children SIGTERM SIGINT
for PID in ${PIDS[*]}; do
wait $PID
done
I've increased NCCL_MIN_NRINGS=5
, this helps just a tiny bit.
@edunov I'm also wondering if there is a simpler way to start multi node distributed training than what I've done so far. I don't have slurm installed on my cluster, so I've been resorting to starting every process manually.
10k is pretty small, do you know what is the network speed between two nodes on AWS? The results we have reported in the paper are obtained on InfiniBand connected cluster, so it was very fast. On Ethernet, we also observe a slow down then two machines are used compared to one.
We're working on a new version at the moment that should help a bit with network latency. And yes, let me update the wiki and the start script. In theory, we should be able to use multiprocessing_train.py to launch distributed jobs.
@edunov 25Gbps between between two nodes. 10Gbps per connection. I can look at bandwidth usage, but I'm not convinced this is the problem. Decreasing from 143k wps (1 node / 8 gpus) to 10k wps (2 nodes / 8 gpus) is a bit much. I would have expected to see something greater than 143k.
Ultimately I'm trying to reproduce the results published in the recent paper : https://arxiv.org/pdf/1806.00187.pdf
Hmm, we haven't experimented on AWS infra before, but we get much better results than this over ethernet. For example, using a batch size of 5120, FP16 and ethernet we get 167k wps on a single node and 237k wps on two nodes. Can you confirm that your AWS instances are in the same placement group?
README was updated, closing for now.
I'm attemping to do distributed training a big transformer model in fp16 using the following script. I receive CUDA out of memory issues. I'm using a p3.16xl on AWS, 8 volta v100 gpus 16gb on a single node. I know I can do the same training using a different distributed training technique by spawning child processes through multiprocessing, but my end goal is to bench this on multi-node. I don't have slurm setup for this, but I'm following the instructions laid out at the end here manually starting one process per gpu: https://github.com/pytorch/fairseq/blob/master/docs/getting_started.rst
This is the output: