BiomedSciAI / fuse-med-ml

A python framework accelerating ML based discovery in the medical field by encouraging code reuse. Batteries included :)
Apache License 2.0
137 stars 34 forks source link

ddp_fix #270

Closed shatz01 closed 1 year ago

shatz01 commented 1 year ago

Single node multi gpu works (tested with 2 and 4 gpu)

shatz01 commented 1 year ago

Confirmed it works multi-node with exactly the same performance as an equivalent multi-gpu single node config :) (e.g. 2 node 1 gpu == 1 node 2 gpu)

mosheraboh commented 1 year ago

@smartdanny is it ready?

shatz01 commented 1 year ago

Basically this commit can be considered ready.

I wanted to also make some of the examples like mnist and knight ddp friendly "plug and play" but I am having a lot of trouble with the batch sampler. It seems pytorch lightning has an extremely hacky way to allow a sampler to work in ddp (see https://github.com/Lightning-AI/lightning/pull/13640).

I can get it working, but it would probably take a bit of work, and im not sure if its worth it at the moment. The options are:

  1. Leave examples the way they are (User would have to know to remove batch sampler)
  2. Remove batch sampler from examples
  3. Make all the examples use lightning datamodule (which has built in support for additional samplers, albeit sketchy)

Let me know what you think is best @mosheraboh @SagiPolaczek

mosheraboh commented 1 year ago

@smartdanny , we want one example with our batch sampler that works - I guess that it means to use datamodule.

shatz01 commented 1 year ago

Last thing is just to make a Knight ddp ready with sampler my making an datamodule. Feel free to merge or I can also add it to this PR. @mosheraboh