Closed zsp1993 closed 4 months ago
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [2024-07-24 07:51:02,930][main][CRITICAL] - Training failed due to NCCL communicator was aborted on rank 3. : Traceback (most recent call last): File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train self.train_loop.run_training_epoch() File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/training_loop.py", line 577, in run_training_epoch self.trainer.run_evaluation(on_epoch=True) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 752, in run_evaluation self.evaluation_loop.on_evaluation_end() File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 100, in on_evaluation_end self.trainer.call_hook('on_validation_end', *args, kwargs) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1095, in call_hook trainer_hook(*args, *kwargs) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/callback_hook.py", line 185, in on_validation_end callback.on_validation_end(self, self.lightning_module) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end self.save_checkpoint(trainer, pl_module) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 259, in save_checkpoint self._save_top_k_checkpoints(trainer, pl_module, monitor_candidates) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 563, in _save_top_k_checkpoints self._update_best_and_save(current, epoch, step, trainer, pl_module, metrics) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 584, in _update_best_and_save filepath = self._get_metric_interpolated_filepath_name(ckpt_name_metrics, epoch, step, trainer, del_filepath) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 508, in _get_metric_interpolated_filepath_name while self.file_exists(filepath, trainer) and filepath != del_filepath: File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 627, in file_exists return trainer.training_type_plugin.broadcast(exists) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 294, in broadcast return self.dist.broadcast(obj) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/distributed/dist.py", line 33, in broadcast broadcast_object_list(obj, 0, group=group or _group.WORLD) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(args, kwargs) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list broadcast(object_sizes_tensor, src=src, group=group) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, **kwargs) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast work.wait() torch.distributed.DistBackendError: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4013, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/jovyan/zsp01/workplace/lama_3/bin/train.py", line 64, in main trainer.fit(training_model) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit self.dispatch() File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch self.accelerator.start_training(self) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training self.training_type_plugin.start_training(trainer) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training self._results = trainer.run_train() File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 670, in run_train self.train_loop.on_train_end() File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/training_loop.py", line 134, in on_train_end self.check_checkpoint_callback(should_update=True, is_last=True) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/training_loop.py", line 164, in check_checkpoint_callback cb.on_validation_end(self.trainer, model) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end self.save_checkpoint(trainer, pl_module) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 259, in save_checkpoint self._save_top_k_checkpoints(trainer, pl_module, monitor_candidates) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 563, in _save_top_k_checkpoints self._update_best_and_save(current, epoch, step, trainer, pl_module, metrics) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 584, in _update_best_and_save filepath = self._get_metric_interpolated_filepath_name(ckpt_name_metrics, epoch, step, trainer, del_filepath) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 508, in _get_metric_interpolated_filepath_name while self.file_exists(filepath, trainer) and filepath != del_filepath: File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 627, in file_exists return trainer.training_type_plugin.broadcast(exists) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 294, in broadcast return self.dist.broadcast(obj) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/distributed/dist.py", line 33, in broadcast broadcast_object_list(obj, 0, group=group or _group.WORLD) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, *kwargs) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list broadcast(object_sizes_tensor, src=src, group=group) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(args, **kwargs) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast work = default_pg.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL communicator was aborted on rank 3.
Epoch 0: 95%|█████████████████████████████████████████████████████████████████████████▊ | 8692/9188 [30:57<01:45, 4.68it/s, loss=11.5, v_num=0][rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4013, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600009 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:626] [Rank 1] Work WorkNCCL(SeqNum=4013, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1). Epoch 0: 95%|█████████████████████████████████████████████████████████████████████████▊ | 8695/9188 [30:58<01:45, 4.68it/s, loss=11.5, v_num=0][rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 4013, last enqueued NCCL work: 4013, last completed NCCL work: 4012. [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [2024-07-24 07:51:22,460][main][CRITICAL] - Training failed due to NCCL communicator was aborted on rank 1. : Traceback (most recent call last): File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train self.train_loop.run_training_epoch() File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/training_loop.py", line 577, in run_training_epoch self.trainer.run_evaluation(on_epoch=True) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 752, in run_evaluation self.evaluation_loop.on_evaluation_end() File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 100, in on_evaluation_end self.trainer.call_hook('on_validation_end', *args, kwargs) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1095, in call_hook trainer_hook(*args, *kwargs) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/callback_hook.py", line 185, in on_validation_end callback.on_validation_end(self, self.lightning_module) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end self.save_checkpoint(trainer, pl_module) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 259, in save_checkpoint self._save_top_k_checkpoints(trainer, pl_module, monitor_candidates) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 563, in _save_top_k_checkpoints self._update_best_and_save(current, epoch, step, trainer, pl_module, metrics) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 584, in _update_best_and_save filepath = self._get_metric_interpolated_filepath_name(ckpt_name_metrics, epoch, step, trainer, del_filepath) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 508, in _get_metric_interpolated_filepath_name while self.file_exists(filepath, trainer) and filepath != del_filepath: File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 627, in file_exists return trainer.training_type_plugin.broadcast(exists) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 294, in broadcast return self.dist.broadcast(obj) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/distributed/dist.py", line 33, in broadcast broadcast_object_list(obj, 0, group=group or _group.WORLD) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(args, kwargs) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list broadcast(object_sizes_tensor, src=src, group=group) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, **kwargs) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast work.wait() torch.distributed.DistBackendError: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4013, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/jovyan/zsp01/workplace/lama_3/bin/train.py", line 64, in main trainer.fit(training_model) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit self.dispatch() File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch self.accelerator.start_training(self) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training self.training_type_plugin.start_training(trainer) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training self._results = trainer.run_train() File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 670, in run_train self.train_loop.on_train_end() File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/training_loop.py", line 134, in on_train_end self.check_checkpoint_callback(should_update=True, is_last=True) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/training_loop.py", line 164, in check_checkpoint_callback cb.on_validation_end(self.trainer, model) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end self.save_checkpoint(trainer, pl_module) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 259, in save_checkpoint self._save_top_k_checkpoints(trainer, pl_module, monitor_candidates) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 563, in _save_top_k_checkpoints self._update_best_and_save(current, epoch, step, trainer, pl_module, metrics) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 584, in _update_best_and_save filepath = self._get_metric_interpolated_filepath_name(ckpt_name_metrics, epoch, step, trainer, del_filepath) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 508, in _get_metric_interpolated_filepath_name while self.file_exists(filepath, trainer) and filepath != del_filepath: File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 627, in file_exists return trainer.training_type_plugin.broadcast(exists) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 294, in broadcast return self.dist.broadcast(obj) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/pytorch_lightning/distributed/dist.py", line 33, in broadcast broadcast_object_list(obj, 0, group=group or _group.WORLD) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, *kwargs) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list broadcast(object_sizes_tensor, src=src, group=group) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(args, **kwargs) File "/home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast work = default_pg.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1.
hey, I am getting error on running train.py file. I want to train lama on my own custom dataset. (EnvPython3_8) PS E:\Image_Inpainting\ImageWebApp\ImageApp> python3.8 train.py -cn ablv2_work_md location=./my_dataset data.batch_size=10 run_title="Image Inpainting"
python3.8 train.py -cn ablv2_work_md location=./my_dataset data.batch_size=10 run_title="Image Inpainting"
Detectron v2 is not installed
C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra\core\default_element.py:122: UserWarning: In 'hydra/overrides': Usage of deprecated keyword in package header '# @package group'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
warnings.warn(
C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra\core\default_element.py:122: UserWarning: In 'trainer/any_gpu_large_ssim_ddp_final_benchmark': Usage of deprecated keyword in package header '# @package group'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
warnings.warn(
C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra\core\default_element.py:122: UserWarning: In 'evaluator/default_inpainted': Usage of deprecated keyword in package header '# @package group'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
warnings.warn(
C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra\core\default_element.py:122: UserWarning: In 'visualizer/directory': Usage of deprecated keyword in package header '# @package group'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
warnings.warn(
C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra\core\default_element.py:122: UserWarning: In 'optimizers/default_optimizers': Usage of deprecated keyword in package header '# @package group'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
warnings.warn(
C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra\core\default_element.py:122: UserWarning: In 'discriminator/pix2pixhd_nlayer': Usage of deprecated keyword in package header '# @package group'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
warnings.warn(
C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra\core\default_element.py:122: UserWarning: In 'generator/pix2pixhd_multidilated_catin_4dil_9b': Usage of deprecated keyword in package header '# @package group'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
warnings.warn(
C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra\core\default_element.py:122: UserWarning: In 'data/abl-04-256-mh-dist': Usage of deprecated keyword in package header '# @package group'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
warnings.warn(
Traceback (most recent call last):
File "train.py", line 77, in
raise ex
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\utils.py", line 211, in run_and_report
return func()
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\utils.py", line 368, in
cfg = self.config_loader.load_configuration(
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\config_loader_impl.py", line 146, in load_configuration
return self._load_configuration_impl(
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\config_loader_impl.py", line 239, in _load_configuration_impl
defaults_list = create_defaults_list(
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\defaults_list.py", line 719, in create_defaults_list
defaults, tree = _create_defaults_list(
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\defaults_list.py", line 689, in _create_defaults_list
defaults_tree = _create_defaults_tree(
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\defaults_list.py", line 337, in _create_defaults_tree
ret = _create_defaults_tree_impl(
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\defaults_list.py", line 420, in _create_defaults_tree_impl
return _expand_virtual_root(repo, root, overrides, skip_missing)
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\defaults_list.py", line 262, in _expand_virtual_root
subtree = _create_defaults_tree_impl(
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\defaults_list.py", line 539, in _create_defaults_tree_impl
add_child(children, new_root)
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\defaults_list.py", line 482, in addchild
subtree = _create_defaults_tree_impl(
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\defaults_list.py", line 429, in _create_defaults_tree_impl
update_package_header(repo=repo, node=parent)
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\defaults_list.py", line 244, in update_package_header
loaded = repo.load_config(config_path=node.get_config_path())
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\config_repository.py", line 337, in load_config
ret = self.delegate.load_config(config_path=config_path)
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\config_repository.py", line 91, in load_config
ret = source.load_config(config_path=config_path)
File "C:\Users\spx016\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\hydra_internal\core_plugins\file_config_source.py", line 28, in load_config
header_text = f.read(512)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\codecs.py", line 322, in decode
could you please help me... thanks in advance..
Hi, I encounter NCCL timeout error at the end of each epoch during training. Here is part of the error message. Epoch 6: 100%|██████████████████████████████████████████████████████████████████████████| 13988/13988 [2:18:37<00:00, 1.68it/s, loss=4.75, v_num=0Epoch 6, global step 49999: val_ssim_fid100_f1_total_mean reached 0.91746 (best 0.91746), saving model to "/home/jovyan/zsp01/workplace/lama/experiments/root_2024-07-24_00-30-46_trainlama-fourier/models/epoch=6-step=49999.ckpt" as top 5 Epoch 7: 0%| | 0/13988 [00:00<?, ?it/s, loss=4.75, v_num=0] [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=100015, OpType=ALLREDUCE, NumelIn=12673, NumelOut=12673, Timeout(ms)=600000) ran for 600094 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 100015, last enqueued NCCL work: 100022, last completed NCCL work: 100014. [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=100015, OpType=ALLREDUCE, NumelIn=12673, NumelOut=12673, Timeout(ms)=600000) ran for 600094 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f87cacd6897 in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f87cbfb11b2 in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f87cbfb5fd0 in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f87cbfb731c in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f8817a68bf4 in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f88199c4609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f881978f133 in /lib/x86_64-linux-gnu/libc.so.6)
scripts_zsp/train_gaoping.sh:行 6: 5139 已放弃 (核心已转储) CUDA_VISIBLE_DEVICES=0,1 python bin/train.py -cn lama-fourier location=gaoping data.batch_size=40 +trainer.kwargs.resume_from_checkpoint=/home/jovyan/zsp01/workplace/lama/experiments/root_2024-07-23_14-44-09_trainlama-fourier/models/last.ckpt