Open awaelchli opened 2 years ago
+1000 for this :D I think if there are specifications where this doesn't work with the cluster env, we should rather think about implementing these missing cluster envs :)
Noob q:
Is it possible that num_nodes < world_size/gpus_per_device
, where the user has access to more nodes in their cluster env than they actually need for their experiment?
@rohitgr7 The ClusterEnvironment we have here means just the environment the current "job" or "experiment" runs in. This is not providing information about the entire cluster. When we talk about world size or number of nodes, it is always just about how many of these resources are used for the current job, not the entirety of resources globally in the cluster.
got it!
haven't trained on multi-node, still learning!! :)
why should devices equal to --ntasks-per-node? devices are the number of GPU, --ntasks-per-node determines the number of processes at the current node. Then, how to achieve data parallel when each GPU has only one process, and the one process has to work with a batch_size like 32, or 64, especially for image datasets.
@x423xu That's how data-parallel training works. One process per GPU. Each process training the model with different data.
This is different from data loading. In each process, you can set DataLoader(num_workers=x) to whatever value you find works well. These two settings, devices=x and num_workers=y, are independent of each other. Two different concepts. Sorry if this is confusing!
I think this topic is a bit off from what is discussed here on the issue. But if you have any further questions, feel free to ping me on our community slack or in our forum.
@awaelchli Thanks, I do have some questions and would move to Slack.
hi @awaelchli! how is it going with this issue? I'm interested in training on multiple nodes with different number of gpus
π Feature
Remove the redundant
num_nodes
Trainer argument. Knowing the number of nodes is not required, and the world size is provided by the cluster environment anyway.Motivation
Users have always struggled getting multi-node training to work because there exists many points of failures in the configurations. One of them is setting the number of devices and nodes, where in Lightning you are required to set
devices=n
andnum_nodes=m
to match theworld_size=n*m
set by the cluster or launcher (torchrun, slurm, etc.). When num_nodes is not set correctly, it leads to too few processes joining the group and thus the program gets stuck.We periodically receive issues and questions where it becomes clear users forget to set this values correctly (#13804, #14022, #10098, #8707, #7429, #6206 to name a few). But technically, Lightning never requires to know the number of nodes, only the world size. This value can be read from the ClusterEnvironment directly.
The conclusion:
num_nodes
in the Trainer is redundant.Another benefit of not having to specify the number of nodes is that the cluster can dynamically allocate nodes and devices based on available resources. In the end, a user who wants to train on 8 GPUs does not care if it is over 2 nodes of 4 GPUs each or over 4 nodes with 2 GPUs each.
Historically,
num_nodes
was introduced before the ClusterEnvironment abstraction became the primary way of collecting information about the cluster.Pitch
Deprecate the redundant
num_nodes
Trainer argument.Pros:
Cons:
Alternatives
Keep it, but validate num_nodes setting against cluster environment (#10107)
Additional context
13506, #7361, #13605
If you enjoy Lightning, check out our other projects! β‘
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.
cc @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta @ananthsub @borda