Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.38k forks source link

Allow obtaining num_nodes from ClusterEnvironment #7361

Open leezu opened 3 years ago

leezu commented 3 years ago

🚀 Feature

num_nodes must currently be specified manually by the user. However, the number of nodes is generally known in a cluster environment [1] and could be provided by and initialized from ClusterEnvironment

[1] For example $AWS_BATCH_JOB_NUM_NODES in https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html

Pitch

https://github.com/PyTorchLightning/pytorch-lightning/blob/763a9a9495977b23cbd6a57f10253b662fd592a5/pytorch_lightning/plugins/training_type/ddp.py#L62-L75

could be updated to initialize num_nodes from ClusterEnvironment if ClusterEnvironment is provided and implements a num_nodes method.

cc @borda @awaelchli @ananthsub

ananthsub commented 3 years ago

@leezu Could you describe why num_nodes being passed through the trainer isn't preferable vs setting it on the cluster environment? Using the argument from the trainer is the direction https://github.com/PyTorchLightning/pytorch-lightning/pull/7026/ is going in

leezu commented 3 years ago

Because trainer.num_nodes needs to be setup explicitly, whereas cluster environment can provide the information automatically. I don't think there's a setting where setting trainer.num_nodes to a different value than the true number of nodes provided by the ClusterEnvironment would be meaningful? Possibly trainer.num_nodes can be inferred from the ClusterEnvironment instead of doing the inference inside the training type plugin?

ananthsub commented 3 years ago

I worry about 2 sources of truth here. num_nodes on the trainer works regardless of the cluster environment used. Only specific training type plugins currently require the cluster environment to be set, so I personally prefer keeping the trainer as the source of truth vs having overrides and also setting it on the cluster environment. @awaelchli what do you think?

leezu commented 3 years ago

so I personally prefer keeping the trainer as the source of truth vs having overrides and also setting it on the cluster environment

That's fine. But that doesn't preclude making num_nodes in the trainer optional and inferring it from the cluster environment if it wasn't specified explicitly. Probably a warning should also be raised if compute environment and trainer disagree on the number of nodes.

awaelchli commented 3 years ago

I have a refactor pending for SLURM #6303 . There, the behaviour is: If num_nodes doesn't match the environment, it goes into interactive mode. otherwise it launches batch mode.

For this I had planned:

class SomeClusterEnv:

def __init__(self, num_nodes: Optional, ...):
    # the cluster env can decide what to do here: 
    # 1) It can warn or error if the num_nodes are not set correctly 
    # 2) It can infer num nodes from environment automatically
    # 3) something else (e.g. SLURM logic)

def num_nodes() -> int:
    return ...

And SLURM implements the above as described.

For the remaining code base of Lightning, the num nodes get derived from Cluster env. This will address also the concern of @ananthsub regarding from how many places the information comes from.

Let me know what you think.

leezu commented 3 years ago

Thank you @awaelchli. Can you elaborate more on 2) It can infer num nodes from environment automatically? Is this expected to make trainer.num_nodes optional and infer from environment if it wasn't specified? Or is this local to the cluster environment object?

awaelchli commented 3 years ago

The idea would be to make Trainer(num_nodes=...) optional depending on the plugin yes, and then trainer.num_nodes will always be what the Environment determines. Not sure what that means for SLURM interactive mode but should be fine.