ariel415el / SimplePytorch-ALAE

Implementation of Adverserial autoencoders
11 stars 1 forks source link

Distributed version #1

Open baleksey opened 3 years ago

baleksey commented 3 years ago

First of all thank you a lot for such a great ALAE implementation with works great. Could you please help to make the code works with multi GPU system? I'm new to pytorch and could not make your code work with DistributedDataParallel. Can you help?

ariel415el commented 3 years ago

Hi, I'm not sure I can help you, I have no access to a multi-gpu machine nor have I ever trained on in such a constellation. Maybe you can write here the problems you encountered and me or some other users will be able to help you.

BTW did you manage to train on a single GPU?

baleksey commented 3 years ago

Yeah it works great on a single GPU machine. The overall training cycle (with default params) for 64x64 ffhq took me about 22 hours.

But I have hard times to make it work on two GPUs. There is a simple distributed template I've tried to follow (attached). But in order to work the main model class must inherit the pytorch nn.Module properly.

Actually you don't need multi GPU system to run this distributed template (it will just use one GPU). So if you want to adapt your current and future codes (starting with ALAE :) ) just try to make it work with one GPU on your system using my distributed template. Multi GPU will be available for free after that.

P.S. Template must work on it's own by default. So you can try it out and see if it work on your side at all.

distibuted.zip

ariel415el commented 3 years ago

Hi, I managed to wrap the models in these lines https://github.com/ariel415el/SimplePytorch-ALAE/blob/9b4374fe2967c947565aa71d6e92ef027cb37fd7/dnn/models/ALAE.py#L32-L36 with nn.DataParallel and run the training but I've no Idea about the impart since, again, I have only one GPU.

Is this what you tried? Please tell me if you manage to train multigpu.

Can you share the training results you had with a single GPU? I don't think I've trained the models for that long and i sure want to see the impact of longer training.

baleksey commented 3 years ago

My 22 hours run was the test one and results was erased after. But visually it was pretty similar to yours on the description.

Actually I tried to do with DistributedDataParallel (which different from just nn.DataParallel and more recommended and you can see the working example of it in my attached file above ). But I've just tried with wrapping mentioned models with nn.DataParallel and it seems like it also speeds things up. Current results for 128x128 per resolution: ............1GPU..............2GPUs 4x4: ....13 min............20 min
8x8:.....26 m..............25 m 16x16:..48 m.............34 m 32x32...1h 50m.........1h 7 m

It looks like further resolutions will increase speedup up to 2x times for 2 GPUs which is nice

I have 2x GeForce 2080 Ti