Open leezu opened 3 years ago
@leezu Could you describe why num_nodes
being passed through the trainer isn't preferable vs setting it on the cluster environment? Using the argument from the trainer is the direction https://github.com/PyTorchLightning/pytorch-lightning/pull/7026/ is going in
Because trainer.num_nodes needs to be setup explicitly, whereas cluster environment can provide the information automatically. I don't think there's a setting where setting trainer.num_nodes to a different value than the true number of nodes provided by the ClusterEnvironment would be meaningful? Possibly trainer.num_nodes can be inferred from the ClusterEnvironment instead of doing the inference inside the training type plugin?
I worry about 2 sources of truth here. num_nodes
on the trainer works regardless of the cluster environment used. Only specific training type plugins currently require the cluster environment to be set, so I personally prefer keeping the trainer as the source of truth vs having overrides and also setting it on the cluster environment. @awaelchli what do you think?
so I personally prefer keeping the trainer as the source of truth vs having overrides and also setting it on the cluster environment
That's fine. But that doesn't preclude making num_nodes
in the trainer optional and inferring it from the cluster environment if it wasn't specified explicitly. Probably a warning should also be raised if compute environment and trainer disagree on the number of nodes.
I have a refactor pending for SLURM #6303 . There, the behaviour is: If num_nodes
doesn't match the environment, it goes into interactive mode. otherwise it launches batch mode.
For this I had planned:
class SomeClusterEnv:
def __init__(self, num_nodes: Optional, ...):
# the cluster env can decide what to do here:
# 1) It can warn or error if the num_nodes are not set correctly
# 2) It can infer num nodes from environment automatically
# 3) something else (e.g. SLURM logic)
def num_nodes() -> int:
return ...
And SLURM implements the above as described.
For the remaining code base of Lightning, the num nodes get derived from Cluster env. This will address also the concern of @ananthsub regarding from how many places the information comes from.
Let me know what you think.
Thank you @awaelchli. Can you elaborate more on 2) It can infer num nodes from environment automatically
? Is this expected to make trainer.num_nodes
optional and infer from environment if it wasn't specified? Or is this local to the cluster environment object?
The idea would be to make Trainer(num_nodes=...)
optional depending on the plugin yes, and then trainer.num_nodes
will always be what the Environment determines. Not sure what that means for SLURM interactive mode but should be fine.
🚀 Feature
num_nodes must currently be specified manually by the user. However, the number of nodes is generally known in a cluster environment [1] and could be provided by and initialized from ClusterEnvironment
[1] For example $AWS_BATCH_JOB_NUM_NODES in https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html
Pitch
https://github.com/PyTorchLightning/pytorch-lightning/blob/763a9a9495977b23cbd6a57f10253b662fd592a5/pytorch_lightning/plugins/training_type/ddp.py#L62-L75
could be updated to initialize num_nodes from ClusterEnvironment if ClusterEnvironment is provided and implements a
num_nodes
method.cc @borda @awaelchli @ananthsub