Open TheShadow29 opened 2 years ago
Hi
Yes, this is a known limitation currently. While it was a true limitation in the past, today it is somewhat artificial. I opened a proposal #14078 which should pave the way to remove this limitation eventually.
After #14078, you would simply set devices="auto"
or devices=-1
and then the actual number of devices can be different per node.
I'm removing the bug label because this can't really be delivered as a bug fix, and depends on the decision in #14078.
π Bug
Currently,
Trainer
requiresnum_nodes
anddevices
, but this may be different across nodes. For instance, slurm may provide 1 node with 6 gpus, and 2 other nodes with 1 gpu each, for a total of 8 nodes. Right now, it gives the following error:To Reproduce
Note: SL_NUM_NODES being set externally
And here is the slurm script (need to add, ,
Expected behavior
Ideally, the world size should be provided by cluster environment, and the trainer should create subprocesses only based on number of gpus available in current node.
Environment
cc @awaelchli @tchaton @rohitgr7 @justusschock @kaushikb11 @akihironitta