Closed shatz01 closed 1 year ago
Confirmed it works multi-node with exactly the same performance as an equivalent multi-gpu single node config :) (e.g. 2 node 1 gpu == 1 node 2 gpu)
@smartdanny is it ready?
Basically this commit can be considered ready.
I wanted to also make some of the examples like mnist and knight ddp friendly "plug and play" but I am having a lot of trouble with the batch sampler. It seems pytorch lightning has an extremely hacky way to allow a sampler to work in ddp (see https://github.com/Lightning-AI/lightning/pull/13640).
I can get it working, but it would probably take a bit of work, and im not sure if its worth it at the moment. The options are:
Let me know what you think is best @mosheraboh @SagiPolaczek
@smartdanny , we want one example with our batch sampler that works - I guess that it means to use datamodule.
Last thing is just to make a Knight ddp ready with sampler my making an datamodule. Feel free to merge or I can also add it to this PR. @mosheraboh
Single node multi gpu works (tested with 2 and 4 gpu)