Open schheda1 opened 1 week ago
No, srun
was used to launch the example. The amount of resources are controlled by sbatch
during submission and read accordingly by modifying initialize_distributed()
in the example script. An excerpt for reference:
rank = int(os.getenv("SLURM_PROCID"))
local_rank = int(os.getenv("SLURM_LOCALID"))
world_size = int(os.getenv("SLURM_NTASKS"))
address = os.getenv("SLURM_LAUNCH_NODE_IPADDR")
port = "29500"
os.environ["MASTER_ADDR"] = address
os.environ["MASTER_PORT"] = port
torch.cuda.set_device(local_rank)
torch.distributed.init_process_group(backend="nccl", init_method="env://",
world_size=world_size, rank=rank)
The indexing errors have been fixed locally, however the hangs still remain when TP=1, PP=2 and 1 node with 2 GPUs are allocated to the job.
Describe the bug When the provided example script is configured to use pipeline parallelism, two different behaviours are observed.
NCCL_DEBUG
does not throw any errors.To Reproduce PP and TP are modified manually with arguments to
initialize_distributed()
. Script run as -srun python -u run_simple_mcore_train_loop.py
Expected behavior The example runs without throwing any errors.
Stack trace/logs Truncated stack trace for case 1:
Stderr for Case 2:
Environment (please complete the following information):
2.5.0a0+git153362f
Proposed fix N/A
Additional context For the case where the hangs were observed, the script seems to run into a problem right around the pipeline warmup phase and before 1F1B schedule in
forward_backward_pipelining_without_interleaving()
inmegatron/core/pipeline_parallel/schedules.py
.Any resolution would be greatly appreciated!