The error log like this:
Current pipeline object is no longer valid.
[E ProcessGroupNCCL.cpp:294] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808702 milliseconds before timing out.
Traceback (most recent call last):
File "/home/pai/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/pai/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/pai/lib/python3.6/site-packages/easypai/torch/launch.py", line 428, in
main()
File "/home/pai/lib/python3.6/site-packages/easypai/torch/launch.py", line 414, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/pai/lib/python3.6/site-packages/easypai/torch/launch.py", line 389, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/apsara/TempRoot/Odps/xxxxx/PyTorchWorker@l80g07299.ea120#0/workspace/python_bin', '-u', 'tools/train.py', '--local_rank=1', 'configs/metric_learning/resnet50_jpg_nopk_tfrecord.py', '--work_dir', 'oss://xxxx/mtl_tf/', '--load_from', 'oss://xxxx/r50_imagenet_epoch_100.pth', '--launcher', 'pytorch', '--fp16']' died with <Signals.SIGABRT: 6>.
Then set os.environ["NCCL_DEBUG_SUBSYS"] = "ALL"
os.environ["NCCL_DEBUG"] = "INFO"
and rerun, the error log like this:
[Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808702 milliseconds before timing out.
Traceback (most recent call last):
File "tools/train.py", line 293, in
main()
File "tools/train.py", line 274, in main
data_loaders = [dataset.get_dataloader()]
File "/apsara/TempRoot/Odps/xxxx/PyTorchWorker@l80g07299.ea120#0/workspace/easycv/datasets/shared/dali_tfrecord_imagenet.py", line 166, in get_dataloader
self.dali_pipe.build()
File "/home/pai/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 478, in build
self._pipe.Build(self._names_and_devices)
RuntimeError: Critical error when building pipeline:
Error when constructing operator: _TFRecordReader encountered:
[/opt/dali/dali/operators/reader/loader/indexed_file_loader.h:105] Assert on "indexuris.size() == uris.size()" failed: Number of index files needs to match the number of data files
The error log like this: Current pipeline object is no longer valid. [E ProcessGroupNCCL.cpp:294] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808702 milliseconds before timing out. Traceback (most recent call last): File "/home/pai/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/pai/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/pai/lib/python3.6/site-packages/easypai/torch/launch.py", line 428, in
main()
File "/home/pai/lib/python3.6/site-packages/easypai/torch/launch.py", line 414, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/pai/lib/python3.6/site-packages/easypai/torch/launch.py", line 389, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/apsara/TempRoot/Odps/xxxxx/PyTorchWorker@l80g07299.ea120#0/workspace/python_bin', '-u', 'tools/train.py', '--local_rank=1', 'configs/metric_learning/resnet50_jpg_nopk_tfrecord.py', '--work_dir', 'oss://xxxx/mtl_tf/', '--load_from', 'oss://xxxx/r50_imagenet_epoch_100.pth', '--launcher', 'pytorch', '--fp16']' died with <Signals.SIGABRT: 6>.
Then set os.environ["NCCL_DEBUG_SUBSYS"] = "ALL" os.environ["NCCL_DEBUG"] = "INFO" and rerun, the error log like this:
[Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808702 milliseconds before timing out. Traceback (most recent call last): File "tools/train.py", line 293, in
main()
File "tools/train.py", line 274, in main
data_loaders = [dataset.get_dataloader()]
File "/apsara/TempRoot/Odps/xxxx/PyTorchWorker@l80g07299.ea120#0/workspace/easycv/datasets/shared/dali_tfrecord_imagenet.py", line 166, in get_dataloader
self.dali_pipe.build()
File "/home/pai/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 478, in build
self._pipe.Build(self._names_and_devices)
RuntimeError: Critical error when building pipeline:
Error when constructing operator: _TFRecordReader encountered:
[/opt/dali/dali/operators/reader/loader/indexed_file_loader.h:105] Assert on "indexuris.size() == uris.size()" failed: Number of index files needs to match the number of data files