Closed CharlesGaydon closed 2 months ago
Preparing test set...: 36%|███▌ | 24/67 [23:16<30:00, 41.87s/it]
Preparing test set...: 37%|███▋ | 25/67 [24:03<30:15, 43.23s/it]
Preparing test set...: 39%|███▉ | 26/67 [24:51<30:30, 44.65s/it]
P[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800008 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:541] [Rank 1] Work WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) timed out in blocking wait (NCCL_BLOCKING_WAIT=1).
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800008 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:541] [Rank 2] Work WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) timed out in blocking wait (NCCL_BLOCKING_WAIT=1).
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
Error executing job with overrides: ['task.task_name=test', 'datamodule.data_dir=/mnt/store-lidarhd/projet-LHD/JEUX_DE_DONNEES_DE_TRAVAIL/Jeu_evaluation/1-Jeu_reference_classe-colorized/', 'datamodule.split_csv_path=/mnt/store-lidarhd/projet-LHD/JEUX_DE_DONNEES_DE_TRAVAIL/Jeu_evaluation/1-Jeu_reference_classe-colorized/split.csv', 'datamodule.hdf5_file_path=/var/tmp/lidar/eval67.hdf5', 'dataset_description=20230601_lidarhd_pacasam_dataset', 'datamodule.tile_width=1000', 'datamodule.epsg=2154', 'experiment=RandLaNet_base_run_FR', 'logger.comet.experiment_name=DATAPAPER-LidarHD-20240416_100k_fractal-6GPUs-Eval67', '++trainer.num_nodes=1', '++trainer.accelerator=gpu', '++trainer.devices=3', 'model.ckpt_path=/mnt/common/hdd/home/CGaydon/experiments/DataPaper-LidarHD/runs/20240418-20240416_100k_fractal-Train-6GPUs/20240418-20240416_100k_fractal-epoch21-Myria3DV3.3.8.ckpt']
Traceback (most recent call last):
File "/mnt/common/hdd/home/CGaydon/repositories/myria3d/run.py", line 121, in <module>
launch_train()
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/main.py", line 48, in decorated_main
_run_hydra(
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "/mnt/common/hdd/home/CGaydon/repositories/myria3d/run.py", line 57, in launch_train
return train(config)
File "/mnt/common/hdd/home/CGaydon/repositories/myria3d/myria3d/train.py", line 156, in train
trainer.test(model=model, datamodule=datamodule, ckpt_path=config.model.ckpt_path)
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 754, in test
return call._call_and_handle_interrupt(
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 794, in _test_impl
results = self._run(model, ckpt_path=ckpt_path)
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 949, in _run
call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 89, in _call_setup_hook
trainer.strategy.barrier("pre_setup")
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 297, in barrier
torch.distributed.barrier(device_ids=self.determine_ddp_device_ids())
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3703, in barrier
work.wait()
RuntimeError: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800008 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
Error executing job with overrides: ['task.task_name=test', 'datamodule.data_dir=/mnt/store-lidarhd/projet-LHD/JEUX_DE_DONNEES_DE_TRAVAIL/Jeu_evaluation/1-Jeu_reference_classe-colorized/', 'datamodule.split_csv_path=/mnt/store-lidarhd/projet-LHD/JEUX_DE_DONNEES_DE_TRAVAIL/Jeu_evaluation/1-Jeu_reference_classe-colorized/split.csv', 'datamodule.hdf5_file_path=/var/tmp/lidar/eval67.hdf5', 'dataset_description=20230601_lidarhd_pacasam_dataset', 'datamodule.tile_width=1000', 'datamodule.epsg=2154', 'experiment=RandLaNet_base_run_FR', 'logger.comet.experiment_name=DATAPAPER-LidarHD-20240416_100k_fractal-6GPUs-Eval67', '++trainer.num_nodes=1', '++trainer.accelerator=gpu', '++trainer.devices=3', 'model.ckpt_path=/mnt/common/hdd/home/CGaydon/experiments/DataPaper-LidarHD/runs/20240418-20240416_100k_fractal-Train-6GPUs/20240418-20240416_100k_fractal-epoch21-Myria3DV3.3.8.ckpt']
Traceback (most recent call last):
File "/mnt/common/hdd/home/CGaydon/repositories/myria3d/run.py", line 121, in <module>
launch_train()
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/main.py", line 48, in decorated_main
_run_hydra(
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "/mnt/common/hdd/home/CGaydon/repositories/myria3d/run.py", line 57, in launch_train
return train(config)
File "/mnt/common/hdd/home/CGaydon/repositories/myria3d/myria3d/train.py", line 156, in train
trainer.test(model=model, datamodule=datamodule, ckpt_path=config.model.ckpt_path)
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 754, in test
return call._call_and_handle_interrupt(
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 794, in _test_impl
results = self._run(model, ckpt_path=ckpt_path)
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 949, in _run
call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 89, in _call_setup_hook
trainer.strategy.barrier("pre_setup")
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 297, in barrier
torch.distributed.barrier(device_ids=self.determine_ddp_device_ids())
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/var/data/shared_envs/anaconda3/envs/myria3d/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3703, in barrier
work.wait()
RuntimeError: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800008 milliseconds before timing out.
srun: launch/slurm: _task_finish: Received task exit notification for 2 tasks of StepId=6053.0 (status=0x0100).
srun: error: DEL2212S020: tasks 1-2: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=6053.0
srun: NET: slurm_set_addr: called with port='6817' host='10.128.41.107'
srun: NET: slurm_set_addr: update addr. addr='10.128.41.107:6817'
srun: debug: task 1 done
reparing test set...: 40%|████ | 27/67 [25:36<29:50, 44.77s/it]
Preparing test set...: 42%|████▏ | 28/67 [26:17<28:30, 43.86s/it]
Preparing test set...: 43%|████▎ | 29/67 [26:29<21:38, 34.16s/it]
Preparing test set...: 45%|████▍ | 30/67 [27:15<23:13, 37.65s/it]
Preparing test set...: 46%|████▋ | 31/67 [29:02<35:12, 58.67s/it]slurmstepd: error: *** STEP 6053.0 ON DEL2212S020 CANCELLED AT 2024-04-23T12:18:08 ***
srun: debug: task 2 done
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=6053.0 (status=0x000f).
srun: error: DEL2212S020: task 0: Terminated
srun: debug: task 0 done
srun: Force Terminated StepId=6053.0
srun: debug: IO thread exiting
May be related to these flags that were set to true in slurm:
Distributed error handlers (optional)
export NCCL_BLOCKING_WAIT=1 export NCCL_ASYNC_ERROR_HANDLING=1
removing the flags above removes the issue.
cf. https://stackoverflow.com/a/73368091/8086033
This happens when using
task.task_name=fit
: gpus are provizionned but not used for more than 30 mins since the process is busy preparing the dataset.Not sure how to fix this... Workaround 1: lanch run.py again until completion of data preparation -> seems to generate error since the writing of the hdf5 was brutally interrupted Workaround 2: first call on every node
run.py
withtask.task_name
="create_hdf5", and then call withtask.task_name=fit
-> may encounter the same issue Workaround 3: first call on a single noderun.py
withtask.task_name
="create_hdf5", then run in multiple nodes withtask.task_name=fit
to train the model...NB: dataset preparation ust to happen on each node in ddp, in method
prepare_data
of the datamodule.