Multiple gpus training - Githubissues

AliaksandrSiarohin / first-order-model

This repository contains the source code for the paper First Order Motion Model for Image Animation

https://aliaksandrsiarohin.github.io/first-order-model-website/

MIT License

14.55k stars 3.22k forks source link

Multiple gpus training #338

Open alessiapacca opened 3 years ago

alessiapacca commented 3 years ago

One question: when I ran the training, I always used 1 single GPU because when I tried to use more than one the usage was always at 0%. Does the code work when all the GPUs are set to "EXCLUSIVE" mode?

AliaksandrSiarohin commented 3 years ago

Have you specified device_ids?

alessiapacca commented 3 years ago

@AliaksandrSiarohin yes, I have always used the command that's on the readme. It may be a problem of the server when I run it, but I just wanted to understand if the code worked when the GPUs where in exclusive mode as that may be an issue with the server.

AliaksandrSiarohin commented 3 years ago

Sorry, I have no idea what thar exclusive mode means. So you may try to see if some simple cifar multi gpu works for you. And if simple cifar with synchronous bn works.

Mathilda88 commented 3 years ago

@alessiapacca may you please share with us on which database you managed to train the network? With Python 3.7.5?

alessiapacca commented 3 years ago

@Eliot04 hey I trained with Vox dataset and I used python 3.6.4

Mathilda88 commented 3 years ago

@alessiapacca Super helpful. Thanks.

SystemErrorWang commented 3 years ago

@alessiapacca I tried to use distributed data parallel to accelerate the training, and it semms to be working. Maybe you can try this too. (but synchronized BatchNorm may have problem when dist data parallel is used, i did not test it)

Qia98 commented 1 year ago

@SystemErrorWang How to use distributed data parallel to accelerate the training ？I put the model and datasets to DDP. but it seams to be not working. the GPU usage was always at 0%

SystemErrorWang commented 1 year ago

@SystemErrorWang How to use distributed data parallel to accelerate the training ？I put the model and datasets to DDP. but it seams to be not working. the GPU usage was always at 0%

I modified the code with this repo: https://github.com/rosinality/stylegan2-pytorch adopted the ddp part of the stylegan2 code, combined with the First-Order Motion Model training code It would spend some time to read the code, but unfortunately my previous code is missing because I changed my job now I believe it's practical and not difficult, wish you good luck!