alibaba / EasyCV

An all-in-one toolkit for computer vision
Apache License 2.0
1.79k stars 203 forks source link

nccl timeout error when read tfrecord format data from oss #107

Closed Cyanyanyan closed 2 years ago

Cyanyanyan commented 2 years ago

The error log like this: Current pipeline object is no longer valid. [E ProcessGroupNCCL.cpp:294] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808702 milliseconds before timing out. Traceback (most recent call last): File "/home/pai/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/pai/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/pai/lib/python3.6/site-packages/easypai/torch/launch.py", line 428, in main() File "/home/pai/lib/python3.6/site-packages/easypai/torch/launch.py", line 414, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/pai/lib/python3.6/site-packages/easypai/torch/launch.py", line 389, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/apsara/TempRoot/Odps/xxxxx/PyTorchWorker@l80g07299.ea120#0/workspace/python_bin', '-u', 'tools/train.py', '--local_rank=1', 'configs/metric_learning/resnet50_jpg_nopk_tfrecord.py', '--work_dir', 'oss://xxxx/mtl_tf/', '--load_from', 'oss://xxxx/r50_imagenet_epoch_100.pth', '--launcher', 'pytorch', '--fp16']' died with <Signals.SIGABRT: 6>.

Then set os.environ["NCCL_DEBUG_SUBSYS"] = "ALL" os.environ["NCCL_DEBUG"] = "INFO" and rerun, the error log like this:

[Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808702 milliseconds before timing out. Traceback (most recent call last): File "tools/train.py", line 293, in main() File "tools/train.py", line 274, in main data_loaders = [dataset.get_dataloader()] File "/apsara/TempRoot/Odps/xxxx/PyTorchWorker@l80g07299.ea120#0/workspace/easycv/datasets/shared/dali_tfrecord_imagenet.py", line 166, in get_dataloader self.dali_pipe.build() File "/home/pai/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 478, in build self._pipe.Build(self._names_and_devices) RuntimeError: Critical error when building pipeline: Error when constructing operator: _TFRecordReader encountered: [/opt/dali/dali/operators/reader/loader/indexed_file_loader.h:105] Assert on "indexuris.size() == uris.size()" failed: Number of index files needs to match the number of data files

wenmengzhou commented 2 years ago

assert on "index_uris.size() == uris.size()" failed: Number of index files needs to match the number of data files

it seems number of input tfrecord files does not equal to the one of tfrecord.idx files