Open alessiapacca opened 3 years ago
Have you specified device_ids?
@AliaksandrSiarohin yes, I have always used the command that's on the readme. It may be a problem of the server when I run it, but I just wanted to understand if the code worked when the GPUs where in exclusive mode as that may be an issue with the server.
Sorry, I have no idea what thar exclusive mode means. So you may try to see if some simple cifar multi gpu works for you. And if simple cifar with synchronous bn works.
@alessiapacca may you please share with us on which database you managed to train the network? With Python 3.7.5?
@Eliot04 hey I trained with Vox dataset and I used python 3.6.4
@alessiapacca Super helpful. Thanks.
@alessiapacca I tried to use distributed data parallel to accelerate the training, and it semms to be working. Maybe you can try this too. (but synchronized BatchNorm may have problem when dist data parallel is used, i did not test it)
@SystemErrorWang How to use distributed data parallel to accelerate the training ?I put the model and datasets to DDP. but it seams to be not working. the GPU usage was always at 0%
@SystemErrorWang How to use distributed data parallel to accelerate the training ?I put the model and datasets to DDP. but it seams to be not working. the GPU usage was always at 0%
I modified the code with this repo: https://github.com/rosinality/stylegan2-pytorch adopted the ddp part of the stylegan2 code, combined with the First-Order Motion Model training code It would spend some time to read the code, but unfortunately my previous code is missing because I changed my job now I believe it's practical and not difficult, wish you good luck!
One question: when I ran the training, I always used 1 single GPU because when I tried to use more than one the usage was always at 0%. Does the code work when all the GPUs are set to "EXCLUSIVE" mode?