Closed drethage closed 4 years ago
For multi-GPU training you need to use python -m torch.distributed.launch --nproc_per_node=$NGPUS Where $NGPUS is the number of GPUS you have in your machine. But the current docker needs to be initialized with multi GPU as well. Make sure your docker container is able to detect all 4 GPUS. You can quickly test it with using CUDA_VISIBLE_DEVICES. I was able to train with the same version of pytorch on multi-GPUs.
Thanks for your response. I added these flags to train.sh, but still get the error "init() got an unexpected keyword argument" when initializing torch.nn.parallel.DistributedDataParallel.
This seems like a pytorch version error since in 1.0.1.post2 torch.nn.parallel.DistributedDataParallel
doesn't have an optional find_unused_parameters
argument.
yes updating it to 1.1.0 should fix the error can you check by changing the docker line 36 to
RUN conda install -y pytorch==1.1.0 cudatoolkit=${CUDA} -c pytorch
If it works i will change the code.
Thanks for letting me know
It works if you change both lines to:
RUN conda install -y pytorch==1.1.0 cudatoolkit=${CUDA} -c pytorch && conda clean -ya
RUN pip install https://download.pytorch.org/whl/cu100/torch-1.1.0-cp36-cp36m-linux_x86_64.whl
Thanks for the update... updated the docker.
Great work! I'm attempting to train the model across 4 gpus on a single machine via python -m torch.distributed.launch, but get a: TypeError: init() got an unexpected keyword argument 'find_unused_parameters'. Is it necessary to use a different version of PyTorch for multi-gpu training? Thanks!