Open waynemystir opened 9 months ago
Did you set difference NODE_RANK to each node ? I currently run multi-node training with lightning v2.2.0 + deepspeed on azure's gpu cluster successfully, without manual set any env varable, (maybe it's set by the cluster system)
@p208p2002 can you share how you setup Azure to work with lightning across multiple nodes? I have Azure working on multiple GPUs on a single node with DDP, but not across multiple nodes. Would love to see how you did it!
Sure, but please note that what I use is a compute cluster from Azure ML Studio.
First you should create the compute cluster under AML.
then create project specific envirement, it's recommend to reference the deepspeed's official docker image
next, write some job submit script with AML Python SDK, through the SDK, you can specify which runtime to use and how to start the training program.
I can show you some part of my job submit script:
from azure.ai.ml import command, Output, MLClient, Input
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential
...
job = command(
display_name=...
code="./", # local path where the code is stored
environment_variables={
"TRAINING_CONFIG": TRAINING_CONFIG,
"WANDB_API_KEY": WANDB_API_KEY,
"HF_TOKEN": HF_TOKEN,
},
command="python sft_trainer.py --num_nodes ${{inputs.num_nodes}} --gpus_per_node ${{inputs.gpus_per_node}} --log_dir ${{outputs.log_dir}} fit",
inputs=inputs,
outputs=outputs,
environment=os.environ["AZURE_ENVIRONMENT"],
compute=os.environ["AZURE_COUPUTE"],
instance_count=inputs["num_nodes"],
distribution={
"type": "PyTorch",
"process_count_per_instance": inputs["gpus_per_node"],
},
)
# submit the command
ml_client.jobs.create_or_update(job)
the code above is not complete, you should done it youself.
you can see that in command, I pass the arg --num_nodes
to sft_trainer.py, that will pass to lightning's trainer later, and in distribution, type and process_count_per_instance is set, also set instance_count greather than 1 for multi-nodes.
Finally, do some little modify for the lightning trainer:
# trainer
trainer = Trainer(
num_nodes=args.num_nodes,
devices=args.gpus_per_node,
...
)
The last thing you should know is that how distributed training system works. In short, the system provides some environment vaiables to identify master/slave nodes, so that can communicate to each other.
This acticle may help: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu?view=azureml-api-2
Thanks
Sure, but please note that what I use is a compute cluster from Azure ML Studio.
First you should create the compute cluster under AML.
then create project specific envirement, it's recommend to reference the deepspeed's official docker image
next, write some job submit script with AML Python SDK, through the SDK, you can specify which runtime to use and how to start the training program.
I can show you some part of my job submit script:
from azure.ai.ml import command, Output, MLClient, Input from azure.ai.ml.constants import AssetTypes, InputOutputModes from azure.identity import DefaultAzureCredential ... job = command( display_name=... code="./", # local path where the code is stored environment_variables={ "TRAINING_CONFIG": TRAINING_CONFIG, "WANDB_API_KEY": WANDB_API_KEY, "HF_TOKEN": HF_TOKEN, }, command="python sft_trainer.py --num_nodes ${{inputs.num_nodes}} --gpus_per_node ${{inputs.gpus_per_node}} --log_dir ${{outputs.log_dir}} fit", inputs=inputs, outputs=outputs, environment=os.environ["AZURE_ENVIRONMENT"], compute=os.environ["AZURE_COUPUTE"], instance_count=inputs["num_nodes"], distribution={ "type": "PyTorch", "process_count_per_instance": inputs["gpus_per_node"], }, ) # submit the command ml_client.jobs.create_or_update(job)
the code above is not complete, you should done it youself.
you can see that in command, I pass the arg
--num_nodes
to sft_trainer.py, that will pass to lightning's trainer later, and in distribution, type and process_count_per_instance is set, also set instance_count greather than 1 for multi-nodes.Finally, do some little modify for the lightning trainer:
# trainer trainer = Trainer( num_nodes=args.num_nodes, devices=args.gpus_per_node, ... )
The last thing you should know is that how distributed training system works. In short, the system provides some environment vaiables to identify master/slave nodes, so that can communicate to each other.
This acticle may help: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu?view=azureml-api-2
Thanks @p208p2002 , I got it all working on my cluster. Not experimented with DeepSpeed yet, but that looks like an interesting avenue for speed ups. Not sure my model is large enough to warrant it though (millions not billions of params for mobile).
Bug description
I am trying to run a very simple training script for 2 nodes and I always get this error:
Output:
What version are you seeing the problem on?
v2.2
How to reproduce the bug
Error messages and logs
Environment
Current environment
``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```More info
No response