Open AAAtourist opened 1 month ago
我正在使用 RTX A6000-48G 进行这个项目,遇到了一些错误, 我的命令是
torchrun --nproc_per_node=4 main.py configs/training/train_resnet18_w2to6_a2to6.yaml
nvidia-smi| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off | | 37% 64C P2 123W / 300W | 2374MiB / 49140MiB | 32% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA RTX A6000 Off | 00000000:25:00.0 Off | Off | | 30% 57C P2 107W / 300W | 1984MiB / 49140MiB | 35% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA RTX A6000 Off | 00000000:41:00.0 Off | Off | | 35% 63C P2 129W / 300W | 1980MiB / 49140MiB | 34% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA RTX A6000 Off | 00000000:61:00.0 Off | Off | | 30% 58C P2 130W / 300W | 1958MiB / 49140MiB | 34% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA RTX A6000 Off | 00000000:81:00.0 Off | Off | | 36% 63C P2 135W / 300W | 1994MiB / 49140MiB | 31% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA RTX A6000 Off | 00000000:A1:00.0 Off | Off | | 30% 58C P2 136W / 300W | 1980MiB / 49140MiB | 31% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA RTX A6000 Off | 00000000:C1:00.0 Off | Off | | 36% 64C P2 129W / 300W | 1960MiB / 49140MiB | 23% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA RTX A6000 Off | 00000000:E1:00.0 Off | Off | | 35% 63C P2 129W / 300W | 1906MiB / 49140MiB | 31% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3135477 C python 2364MiB | | 1 N/A N/A 3135477 C python 1974MiB | | 2 N/A N/A 3135477 C python 1972MiB | | 3 N/A N/A 3135477 C python 1952MiB | | 4 N/A N/A 3135477 C python 1972MiB | | 5 N/A N/A 3135477 C python 1972MiB | | 6 N/A N/A 3135477 C python 1938MiB | | 7 N/A N/A 3135477 C python 1898MiB | +---------------------------------------------------------------------------------------+
我附上了那个错误
W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] ***************************************** W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] ***************************************** Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False}) Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False}) Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False}) Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False}) INFO - Log file for this run: /data/user/tourist/retraining-free-quantization/training/resnet18/resnet18.log INFO - TensorBoard data directory: /data/user/tourist/retraining-free-quantization/training/resnet18/tb_runs /home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k. model = create_fn( /home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k. model = create_fn( /home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k. model = create_fn( /home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k. model = create_fn( layer QuanConv2d not using splited a_w cands!layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! INFO - Created `resnet18` model for `imagenet` dataset Use pre-trained model = True INFO - Inserted quantizers into the original model layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! [rank1]: Traceback (most recent call last): [rank1]: File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module> [rank1]: main() [rank1]: File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main [rank1]: model = wrap_the_model_with_ddp(model) [rank1]: File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda> [rank1]: wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True) [rank1]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__ [rank1]: _verify_param_shape_across_processes(self.process_group, parameters) [rank1]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes [rank1]: return dist._verify_params_across_processes(process_group, tensors, logger) [rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5 [rank1]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library. [rank1]: Last error: [rank1]: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000 [rank2]: Traceback (most recent call last): [rank2]: File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module> [rank2]: main() [rank2]: File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main [rank2]: model = wrap_the_model_with_ddp(model) [rank2]: File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda> [rank2]: wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True) [rank2]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__ [rank2]: _verify_param_shape_across_processes(self.process_group, parameters) [rank2]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes [rank2]: return dist._verify_params_across_processes(process_group, tensors, logger) [rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5 [rank2]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library. [rank2]: Last error: [rank2]: Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 1000 [rank3]: Traceback (most recent call last): [rank3]: File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module> [rank3]: main() [rank3]: File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main [rank3]: model = wrap_the_model_with_ddp(model) [rank3]: File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda> [rank3]: wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True) [rank3]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__ [rank3]: _verify_param_shape_across_processes(self.process_group, parameters) [rank3]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes [rank3]: return dist._verify_params_across_processes(process_group, tensors, logger) [rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5 [rank3]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library. [rank3]: Last error: [rank3]: Duplicate GPU detected : rank 3 and rank 0 both on CUDA device 1000 [rank0]: Traceback (most recent call last): [rank0]: File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module> [rank0]: main() [rank0]: File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main [rank0]: model = wrap_the_model_with_ddp(model) [rank0]: File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda> [rank0]: wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True) [rank0]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__ [rank0]: _verify_param_shape_across_processes(self.process_group, parameters) [rank0]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes [rank0]: return dist._verify_params_across_processes(process_group, tensors, logger) [rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5 [rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library. [rank0]: Last error: [rank0]: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000 W0731 12:48:48.435105 140248971256960 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1049238 closing signal SIGTERM W0731 12:48:48.435611 140248971256960 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1049240 closing signal SIGTERM E0731 12:48:48.550357 140248971256960 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 1049239) of binary: /home/tourist/.conda/envs/RFQuant/bin/python Traceback (most recent call last): File "/home/tourist/.conda/envs/RFQuant/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ main.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-31_12:48:48 host : leo rank : 3 (local_rank: 3) exitcode : 1 (pid: 1049241) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-31_12:48:48 host : leo rank : 1 (local_rank: 1) exitcode : 1 (pid: 1049239) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
您的这个错误非常简单,我代码已经调通欢迎联系qq:2475684972
INFO - >>>>>>>> Epoch 20
Bit-width candidates: [6, 5, 4, 3, 2]
INFO - Training: 1076 samples (8 per mini-batch)
INFO - Max Training [20][ 20/ 135] Loss 0.392159 QE Loss 0.000000 Distribution Loss 0.000000 IDM Loss 0.000000 Top1 84.375000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 0 Training [20][ 20/ 135] Loss 0.214411 QE Loss 0.000000 Distribution Loss -0.011015 IDM Loss 0.268419 Top1 84.375000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 1 Training [20][ 20/ 135] Loss 0.211393 QE Loss 0.000000 Distribution Loss -0.013542 IDM Loss 0.315946 Top1 83.750000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 2 Training [20][ 20/ 135] Loss 0.198101 QE Loss 0.000000 Distribution Loss -0.009167 IDM Loss 0.256765 Top1 85.000000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - ===================================================================================================================
INFO - Max Training [20][ 40/ 135] Loss 0.435212 QE Loss 0.000000 Distribution Loss 0.000000 IDM Loss 0.000000 Top1 85.625000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 0 Training [20][ 40/ 135] Loss 0.238588 QE Loss 0.000000 Distribution Loss -0.010820 IDM Loss 0.248704 Top1 84.062500 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 1 Training [20][ 40/ 135] Loss 0.229547 QE Loss 0.000000 Distribution Loss -0.012565 IDM Loss 0.280446 Top1 83.437500 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 2 Training [20][ 40/ 135] Loss 0.228397 QE Loss 0.000000 Distribution Loss -0.009888 IDM Loss 0.241818 Top1 84.687500 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - ===================================================================================================================
INFO - Max Training [20][ 60/ 135] Loss 0.443414 QE Loss 0.000000 Distribution Loss 0.000000 IDM Loss 0.000000 Top1 83.541667 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 0 Training [20][ 60/ 135] Loss 0.236105 QE Loss 0.000000 Distribution Loss -0.011834 IDM Loss 0.267503 Top1 81.875000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 1 Training [20][ 60/ 135] Loss 0.237189 QE Loss 0.000000 Distribution Loss -0.012289 IDM Loss 0.264025 Top1 81.875000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 2 Training [20][ 60/ 135] Loss 0.233082 QE Loss 0.000000 Distribution Loss -0.011429 IDM Loss 0.260655 Top1 83.333333 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - ===================================================================================================================
INFO - Max Training [20][ 80/ 135] Loss 0.435657 QE Loss 0.000000 Distribution Loss 0.000000 IDM Loss 0.000000 Top1 83.125000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 0 Training [20][ 80/ 135] Loss 0.235606 QE Loss 0.000000 Distribution Loss -0.011172 IDM Loss 0.258127 Top1 81.718750 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 1 Training [20][ 80/ 135] Loss 0.239393 QE Loss 0.000000 Distribution Loss -0.012690 IDM Loss 0.269106 Top1 81.562500 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 2 Training [20][ 80/ 135] Loss 0.236801 QE Loss 0.000000 Distribution Loss -0.012568 IDM Loss 0.282858 Top1 82.343750 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - ===================================================================================================================
INFO - Max Training [20][ 100/ 135] Loss 0.449233 QE Loss 0.000000 Distribution Loss 0.000000 IDM Loss 0.000000 Top1 83.250000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 0 Training [20][ 100/ 135] Loss 0.245851 QE Loss 0.000000 Distribution Loss -0.011134 IDM Loss 0.262823 Top1 80.750000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 1 Training [20][ 100/ 135] Loss 0.252130 QE Loss 0.000000 Distribution Loss -0.012596 IDM Loss 0.262081 Top1 81.250000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 2 Training [20][ 100/ 135] Loss 0.244423 QE Loss 0.000000 Distribution Loss -0.012493 IDM Loss 0.288368 Top1 81.750000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - ===================================================================================================================
INFO - Max Training [20][ 120/ 135] Loss 0.455972 QE Loss 0.000000 Distribution Loss 0.000000 IDM Loss 0.000000 Top1 82.812500 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 0 Training [20][ 120/ 135] Loss 0.249014 QE Loss 0.000000 Distribution Loss -0.010803 IDM Loss 0.255996 Top1 80.312500 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 1 Training [20][ 120/ 135] Loss 0.256232 QE Loss 0.000000 Distribution Loss -0.012910 IDM Loss 0.267811 Top1 80.729167 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 2 Training [20][ 120/ 135] Loss 0.249085 QE Loss 0.000000 Distribution Loss -0.013087 IDM Loss 0.290680 Top1 81.250000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - ===================================================================================================================
INFO - ==> Max Top1: 83.022 Top5: 100.000 Loss: 0.450
INFO - ==> Mixed 0 Top1: 80.970 Top5: 100.000 Loss: 0.245
INFO - ==> Mixed 1 Top1: 81.437 Top5: 100.000 Loss: 0.253
INFO - ==> Mixed 2 Top1: 81.623 Top5: 100.000 Loss: 0.247
INFO - Scoreboard best 1 ==> Epoch [20][Top1: 0.000 Top5: 0.000]
INFO - Scoreboard best 2 ==> Epoch [19][Top1: 0.000 Top5: 0.000]
INFO - Scoreboard best 3 ==> Epoch [18][Top1: 0.000 Top5: 0.000]
I am working on this project with RTX A6000-48G and I have met some bugs my command is
torchrun --nproc_per_node=4 main.py configs/training/train_resnet18_w2to6_a2to6.yaml
nvidia-smi| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off | | 37% 64C P2 123W / 300W | 2374MiB / 49140MiB | 32% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA RTX A6000 Off | 00000000:25:00.0 Off | Off | | 30% 57C P2 107W / 300W | 1984MiB / 49140MiB | 35% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA RTX A6000 Off | 00000000:41:00.0 Off | Off | | 35% 63C P2 129W / 300W | 1980MiB / 49140MiB | 34% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA RTX A6000 Off | 00000000:61:00.0 Off | Off | | 30% 58C P2 130W / 300W | 1958MiB / 49140MiB | 34% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA RTX A6000 Off | 00000000:81:00.0 Off | Off | | 36% 63C P2 135W / 300W | 1994MiB / 49140MiB | 31% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA RTX A6000 Off | 00000000:A1:00.0 Off | Off | | 30% 58C P2 136W / 300W | 1980MiB / 49140MiB | 31% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA RTX A6000 Off | 00000000:C1:00.0 Off | Off | | 36% 64C P2 129W / 300W | 1960MiB / 49140MiB | 23% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA RTX A6000 Off | 00000000:E1:00.0 Off | Off | | 35% 63C P2 129W / 300W | 1906MiB / 49140MiB | 31% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3135477 C python 2364MiB | | 1 N/A N/A 3135477 C python 1974MiB | | 2 N/A N/A 3135477 C python 1972MiB | | 3 N/A N/A 3135477 C python 1952MiB | | 4 N/A N/A 3135477 C python 1972MiB | | 5 N/A N/A 3135477 C python 1972MiB | | 6 N/A N/A 3135477 C python 1938MiB | | 7 N/A N/A 3135477 C python 1898MiB | +---------------------------------------------------------------------------------------+
I attach that error
W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] ***************************************** W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] ***************************************** Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False}) Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False}) Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False}) Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False}) INFO - Log file for this run: /data/user/tourist/retraining-free-quantization/training/resnet18/resnet18.log INFO - TensorBoard data directory: /data/user/tourist/retraining-free-quantization/training/resnet18/tb_runs /home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k. model = create_fn( /home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k. model = create_fn( /home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k. model = create_fn( /home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k. model = create_fn( layer QuanConv2d not using splited a_w cands!layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! INFO - Created `resnet18` model for `imagenet` dataset Use pre-trained model = True INFO - Inserted quantizers into the original model layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! layer QuanConv2d not using splited a_w cands! [rank1]: Traceback (most recent call last): [rank1]: File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module> [rank1]: main() [rank1]: File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main [rank1]: model = wrap_the_model_with_ddp(model) [rank1]: File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda> [rank1]: wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True) [rank1]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__ [rank1]: _verify_param_shape_across_processes(self.process_group, parameters) [rank1]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes [rank1]: return dist._verify_params_across_processes(process_group, tensors, logger) [rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5 [rank1]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library. [rank1]: Last error: [rank1]: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000 [rank2]: Traceback (most recent call last): [rank2]: File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module> [rank2]: main() [rank2]: File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main [rank2]: model = wrap_the_model_with_ddp(model) [rank2]: File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda> [rank2]: wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True) [rank2]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__ [rank2]: _verify_param_shape_across_processes(self.process_group, parameters) [rank2]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes [rank2]: return dist._verify_params_across_processes(process_group, tensors, logger) [rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5 [rank2]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library. [rank2]: Last error: [rank2]: Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 1000 [rank3]: Traceback (most recent call last): [rank3]: File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module> [rank3]: main() [rank3]: File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main [rank3]: model = wrap_the_model_with_ddp(model) [rank3]: File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda> [rank3]: wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True) [rank3]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__ [rank3]: _verify_param_shape_across_processes(self.process_group, parameters) [rank3]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes [rank3]: return dist._verify_params_across_processes(process_group, tensors, logger) [rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5 [rank3]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library. [rank3]: Last error: [rank3]: Duplicate GPU detected : rank 3 and rank 0 both on CUDA device 1000 [rank0]: Traceback (most recent call last): [rank0]: File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module> [rank0]: main() [rank0]: File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main [rank0]: model = wrap_the_model_with_ddp(model) [rank0]: File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda> [rank0]: wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True) [rank0]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__ [rank0]: _verify_param_shape_across_processes(self.process_group, parameters) [rank0]: File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes [rank0]: return dist._verify_params_across_processes(process_group, tensors, logger) [rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5 [rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library. [rank0]: Last error: [rank0]: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000 W0731 12:48:48.435105 140248971256960 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1049238 closing signal SIGTERM W0731 12:48:48.435611 140248971256960 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1049240 closing signal SIGTERM E0731 12:48:48.550357 140248971256960 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 1049239) of binary: /home/tourist/.conda/envs/RFQuant/bin/python Traceback (most recent call last): File "/home/tourist/.conda/envs/RFQuant/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ main.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-31_12:48:48 host : leo rank : 3 (local_rank: 3) exitcode : 1 (pid: 1049241) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-31_12:48:48 host : leo rank : 1 (local_rank: 1) exitcode : 1 (pid: 1049239) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
刚注意到您是个土豪这么多显卡我就有2张用1张跑通的。
Hi, which version of pytorch are you using? We have tested our code with 1.13.0. Meanwhile, please try to modify the L79 of main.py to:
model.cuda() model = DistributedDataParallel(model, device_ids=[args.local_rank], find_unused_parameters=True)
Hi, which version of pytorch are you using? We have tested our code with 1.13.0. Meanwhile, please try to modify the L79 of main.py to:
model.cuda() model = DistributedDataParallel(model, device_ids=[args.local_rank], find_unused_parameters=True)
my pytorch version is 2.4.0 I have modified my code but it still does not work
我真服了有些人了我好心好意帮他,告诉他我怎么调出来的他跟我这搞我心态拉黑我。
I am working on this project with RTX A6000-48G and I have met some bugs my command is
torchrun --nproc_per_node=4 main.py configs/training/train_resnet18_w2to6_a2to6.yaml
nvidia-smiI attach that error