Update `get_rank()` to `LOCAL_RANK` for multi-node running

Project-MONAI / model-zoo

MONAI Model Zoo that hosts models in the MONAI Bundle format.

Apache License 2.0

179 stars 67 forks source link

Update `get_rank()` to `LOCAL_RANK` for multi-node running #444

Closed Nic-Ma closed 1 year ago

Nic-Ma commented 1 year ago

Is your feature request related to a problem? Please describe. There is an error in the "device": "$torch.device(f'cuda:{dist.get_rank()}')" It should be changed to:

"local_rank": "$os.environ['LOCAL_RANK']",
"device": "$torch.device(f'cuda:{@local_rank}')"

Otherwise, the dist.get_rank() will get rank > 7 for node 1, 2, 3... Same issue for all the existing bundles. Refer to: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L62 CC @SachidanandAlle

Nic-Ma commented 1 year ago

Hi @yiheng-wang-nv ,

Let's try to fix this and verify the multi-node training / evaluation on our cloud platform. CC @SachidanandAlle @tangy5 .

Thanks.

ericspod commented 1 year ago

I have this in a bundle I've been working on that states a valid rank whether distributed training is active or not:

is_dist: '$dist.is_initialized()'
rank: '$dist.get_rank() if @is_dist else 0'
is_not_rank0: '$@rank > 0'  # used to disable saving and logging on other ranks
device: '$torch.device(f"cuda:{@rank}" if torch.cuda.is_available() else "cpu")'

Nic-Ma commented 1 year ago

Hi @ericspod ,

Your example is the same as all the bundles, the current problem is only for the device var on multi-node training. Different nodes should use the same device indices.

Thanks.