dineshreddy91 / Occlusion_Net

[CVPR2019]Occlusion-Net: 2D/3D Occluded Keypoint Localization Using Graph Networks
Other
139 stars 34 forks source link

Multi-gpu training #8

Closed drethage closed 4 years ago

drethage commented 4 years ago

Great work! I'm attempting to train the model across 4 gpus on a single machine via python -m torch.distributed.launch, but get a: TypeError: init() got an unexpected keyword argument 'find_unused_parameters'. Is it necessary to use a different version of PyTorch for multi-gpu training? Thanks!

dineshreddy91 commented 4 years ago

For multi-GPU training you need to use python -m torch.distributed.launch --nproc_per_node=$NGPUS Where $NGPUS is the number of GPUS you have in your machine. But the current docker needs to be initialized with multi GPU as well. Make sure your docker container is able to detect all 4 GPUS. You can quickly test it with using CUDA_VISIBLE_DEVICES. I was able to train with the same version of pytorch on multi-GPUs.

drethage commented 4 years ago

Thanks for your response. I added these flags to train.sh, but still get the error "init() got an unexpected keyword argument" when initializing torch.nn.parallel.DistributedDataParallel. This seems like a pytorch version error since in 1.0.1.post2 torch.nn.parallel.DistributedDataParallel doesn't have an optional find_unused_parameters argument.

dineshreddy91 commented 4 years ago

yes updating it to 1.1.0 should fix the error can you check by changing the docker line 36 to

RUN conda install -y pytorch==1.1.0 cudatoolkit=${CUDA} -c pytorch

If it works i will change the code.

Thanks for letting me know

drethage commented 4 years ago

It works if you change both lines to:

RUN conda install -y pytorch==1.1.0 cudatoolkit=${CUDA} -c pytorch && conda clean -ya

RUN pip install https://download.pytorch.org/whl/cu100/torch-1.1.0-cp36-cp36m-linux_x86_64.whl

dineshreddy91 commented 4 years ago

Thanks for the update... updated the docker.