Azure / azureml-examples

Official community-driven Azure Machine Learning examples, tested with GitHub Actions.
https://docs.microsoft.com/azure/machine-learning
MIT License
1.74k stars 1.42k forks source link

train/pytorch/cifar-distributed not working #675

Open ManojBableshwar opened 3 years ago

ManojBableshwar commented 3 years ago

[2021-08-17T22:56:28.664111] Starting Linux command : python train.py --epochs 1 --data-dir /mnt/batch/tasks/shared/LS_root/jobs/opendatasetspmworkspace/azureml/6215701e-b1ef-42d0-91d1-864583d0dbbd/wd/cifar_65d20ecd-eef7-471c-b271-3b2cfa019269


8dbe9e7102e4446db27ef6fb1a2533e9000001:176:176 [0] NCCL INFO Bootstrap : Using [0]eth0:10.0.0.5<0> 8dbe9e7102e4446db27ef6fb1a2533e9000001:176:176 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation 8dbe9e7102e4446db27ef6fb1a2533e9000001:176:176 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. 8dbe9e7102e4446db27ef6fb1a2533e9000001:176:176 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0> 8dbe9e7102e4446db27ef6fb1a2533e9000001:176:176 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda11.0

8dbe9e7102e4446db27ef6fb1a2533e9000001:176:194 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1cb00000 8dbe9e7102e4446db27ef6fb1a2533e9000001:176:194 [0] NCCL INFO init.cc:840 -> 5 8dbe9e7102e4446db27ef6fb1a2533e9000001:176:194 [0] NCCL INFO group.cc:73 -> 5 [Async thread] Traceback (most recent call last): File "train.py", line 252, in main(args) File "train.py", line 132, in main torch.distributed.init_process_group(backend="nccl") File "/azureml-envs/pytorch-1.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/azureml-envs/pytorch-1.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8 [2021-08-17T22:56:33.764585] Command finished with return code 1

[2021-08-17T22:56:33.765429] The experiment failed with exit code: 1. Finalizing run... Cleaning up all outstanding Run operations, waiting 900.0 seconds 1 items cleaning up... Cleanup took 0.04428815841674805 seconds Traceback (most recent call last): File "/mnt/batch/tasks/shared/LS_root/jobs/opendatasetspmworkspace/azureml/6215701e-b1ef-42d0-91d1-864583d0dbbd/wd/azureml/6215701e-b1ef-42d0-91d1-864583d0dbbd/azureml-setup/context_manager_injector.py", line 454, in execute_with_context(cm_objects, options.invocation) File "/mnt/batch/tasks/shared/LS_root/jobs/opendatasetspmworkspace/azureml/6215701e-b1ef-42d0-91d1-864583d0dbbd/wd/azureml/6215701e-b1ef-42d0-91d1-864583d0dbbd/azureml-setup/context_manager_injector.py", line 235, in execute_with_context process_return_code(signedReturnCode) File "/mnt/batch/tasks/shared/LS_root/jobs/opendatasetspmworkspace/azureml/6215701e-b1ef-42d0-91d1-864583d0dbbd/wd/azureml/6215701e-b1ef-42d0-91d1-864583d0dbbd/azureml-setup/context_manager_injector.py", line 355, in process_return_code sys.exit(returnCode) SystemExit: 1

[2021-08-17T22:56:33.924128] Finished context manager injector with SystemExit exception.

lostmygithubaccount commented 3 years ago

this example is passing consistently: https://github.com/Azure/azureml-examples/actions?query=workflow%3Acli-scripts-train

can you provide more information about your Workspace and other factors? in particular:

if it's not easy to debug from one of these options, we probably need an ICM to have a team investigate

ManojBableshwar commented 3 years ago

Compute: Standard_NC6 (6 cores, 56 GB RAM, 380 GB disk)

Workspace info: "subscription_id": "21d8f407-c4c4-452e-87a4-e609bfb86248", "resource_group": "OpenDatasetsPMRG", "workspace_name": "OpenDatasetsPMWorkspace"

region: eastus2

preview/experimental features: pipeline preview enabled

datastore: workspace default (blob I guess)