Closed Nic-Ma closed 1 year ago
Hi @yiheng-wang-nv ,
Let's try to fix this and verify the multi-node training / evaluation on our cloud platform. CC @SachidanandAlle @tangy5 .
Thanks.
I have this in a bundle I've been working on that states a valid rank whether distributed training is active or not:
is_dist: '$dist.is_initialized()'
rank: '$dist.get_rank() if @is_dist else 0'
is_not_rank0: '$@rank > 0' # used to disable saving and logging on other ranks
device: '$torch.device(f"cuda:{@rank}" if torch.cuda.is_available() else "cpu")'
Hi @ericspod ,
Your example is the same as all the bundles, the current problem is only for the device
var on multi-node training. Different nodes should use the same device
indices.
Thanks.
Is your feature request related to a problem? Please describe. There is an error in the
"device": "$torch.device(f'cuda:{dist.get_rank()}')"
It should be changed to:Otherwise, the dist.get_rank() will get rank > 7 for node 1, 2, 3... Same issue for all the existing bundles. Refer to: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L62 CC @SachidanandAlle