layer QuanConv2d not using splited a_w cands! Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000

AAAtourist commented 1 month ago

I am working on this project with RTX A6000-48G and I have met some bugs my command is torchrun --nproc_per_node=4 main.py configs/training/train_resnet18_w2to6_a2to6.yaml nvidia-smi

| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:01:00.0 Off |                  Off |
| 37%   64C    P2             123W / 300W |   2374MiB / 49140MiB |     32%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:25:00.0 Off |                  Off |
| 30%   57C    P2             107W / 300W |   1984MiB / 49140MiB |     35%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               Off | 00000000:41:00.0 Off |                  Off |
| 35%   63C    P2             129W / 300W |   1980MiB / 49140MiB |     34%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000               Off | 00000000:61:00.0 Off |                  Off |
| 30%   58C    P2             130W / 300W |   1958MiB / 49140MiB |     34%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000               Off | 00000000:81:00.0 Off |                  Off |
| 36%   63C    P2             135W / 300W |   1994MiB / 49140MiB |     31%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000               Off | 00000000:A1:00.0 Off |                  Off |
| 30%   58C    P2             136W / 300W |   1980MiB / 49140MiB |     31%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A6000               Off | 00000000:C1:00.0 Off |                  Off |
| 36%   64C    P2             129W / 300W |   1960MiB / 49140MiB |     23%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A6000               Off | 00000000:E1:00.0 Off |                  Off |
| 35%   63C    P2             129W / 300W |   1906MiB / 49140MiB |     31%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3135477      C   python                                     2364MiB |
|    1   N/A  N/A   3135477      C   python                                     1974MiB |
|    2   N/A  N/A   3135477      C   python                                     1972MiB |
|    3   N/A  N/A   3135477      C   python                                     1952MiB |
|    4   N/A  N/A   3135477      C   python                                     1972MiB |
|    5   N/A  N/A   3135477      C   python                                     1972MiB |
|    6   N/A  N/A   3135477      C   python                                     1938MiB |
|    7   N/A  N/A   3135477      C   python                                     1898MiB |
+---------------------------------------------------------------------------------------+

I attach that error

W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] 
W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] *****************************************
W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] *****************************************
Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False})
Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False})
Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False})
Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False})
INFO - Log file for this run: /data/user/tourist/retraining-free-quantization/training/resnet18/resnet18.log
INFO - TensorBoard data directory: /data/user/tourist/retraining-free-quantization/training/resnet18/tb_runs
/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k.
  model = create_fn(
/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k.
  model = create_fn(
/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k.
  model = create_fn(
/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k.
  model = create_fn(
layer QuanConv2d not using splited a_w cands!layer QuanConv2d not using splited a_w cands!

layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
INFO - Created `resnet18` model for `imagenet` dataset
          Use pre-trained model = True
INFO - Inserted quantizers into the original model
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module>
[rank1]:     main()
[rank1]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main
[rank1]:     model = wrap_the_model_with_ddp(model)
[rank1]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda>
[rank1]:     wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True)
[rank1]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank1]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank1]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
[rank1]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank1]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank1]: Last error:
[rank1]: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
[rank2]: Traceback (most recent call last):
[rank2]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module>
[rank2]:     main()
[rank2]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main
[rank2]:     model = wrap_the_model_with_ddp(model)
[rank2]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda>
[rank2]:     wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True)
[rank2]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank2]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank2]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
[rank2]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank2]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank2]: Last error:
[rank2]: Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 1000
[rank3]: Traceback (most recent call last):
[rank3]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module>
[rank3]:     main()
[rank3]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main
[rank3]:     model = wrap_the_model_with_ddp(model)
[rank3]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda>
[rank3]:     wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True)
[rank3]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank3]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank3]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
[rank3]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank3]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank3]: Last error:
[rank3]: Duplicate GPU detected : rank 3 and rank 0 both on CUDA device 1000
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module>
[rank0]:     main()
[rank0]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main
[rank0]:     model = wrap_the_model_with_ddp(model)
[rank0]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda>
[rank0]:     wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True)
[rank0]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank0]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank0]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
[rank0]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank0]: Last error:
[rank0]: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
W0731 12:48:48.435105 140248971256960 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1049238 closing signal SIGTERM
W0731 12:48:48.435611 140248971256960 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1049240 closing signal SIGTERM
E0731 12:48:48.550357 140248971256960 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 1049239) of binary: /home/tourist/.conda/envs/RFQuant/bin/python
Traceback (most recent call last):
  File "/home/tourist/.conda/envs/RFQuant/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-31_12:48:48
  host      : leo
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 1049241)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-31_12:48:48
  host      : leo
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1049239)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

liuyiming199721 commented 1 month ago

我正在使用 RTX A6000-48G 进行这个项目，遇到了一些错误，我的命令是 torchrun --nproc_per_node=4 main.py configs/training/train_resnet18_w2to6_a2to6.yaml nvidia-smi

| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:01:00.0 Off |                  Off |
| 37%   64C    P2             123W / 300W |   2374MiB / 49140MiB |     32%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:25:00.0 Off |                  Off |
| 30%   57C    P2             107W / 300W |   1984MiB / 49140MiB |     35%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               Off | 00000000:41:00.0 Off |                  Off |
| 35%   63C    P2             129W / 300W |   1980MiB / 49140MiB |     34%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000               Off | 00000000:61:00.0 Off |                  Off |
| 30%   58C    P2             130W / 300W |   1958MiB / 49140MiB |     34%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000               Off | 00000000:81:00.0 Off |                  Off |
| 36%   63C    P2             135W / 300W |   1994MiB / 49140MiB |     31%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000               Off | 00000000:A1:00.0 Off |                  Off |
| 30%   58C    P2             136W / 300W |   1980MiB / 49140MiB |     31%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A6000               Off | 00000000:C1:00.0 Off |                  Off |
| 36%   64C    P2             129W / 300W |   1960MiB / 49140MiB |     23%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A6000               Off | 00000000:E1:00.0 Off |                  Off |
| 35%   63C    P2             129W / 300W |   1906MiB / 49140MiB |     31%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3135477      C   python                                     2364MiB |
|    1   N/A  N/A   3135477      C   python                                     1974MiB |
|    2   N/A  N/A   3135477      C   python                                     1972MiB |
|    3   N/A  N/A   3135477      C   python                                     1952MiB |
|    4   N/A  N/A   3135477      C   python                                     1972MiB |
|    5   N/A  N/A   3135477      C   python                                     1972MiB |
|    6   N/A  N/A   3135477      C   python                                     1938MiB |
|    7   N/A  N/A   3135477      C   python                                     1898MiB |
+---------------------------------------------------------------------------------------+

我附上了那个错误

W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] 
W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] *****************************************
W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] *****************************************
Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False})
Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False})
Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False})
Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False})
INFO - Log file for this run: /data/user/tourist/retraining-free-quantization/training/resnet18/resnet18.log
INFO - TensorBoard data directory: /data/user/tourist/retraining-free-quantization/training/resnet18/tb_runs
/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k.
  model = create_fn(
/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k.
  model = create_fn(
/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k.
  model = create_fn(
/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k.
  model = create_fn(
layer QuanConv2d not using splited a_w cands!layer QuanConv2d not using splited a_w cands!

layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
INFO - Created `resnet18` model for `imagenet` dataset
          Use pre-trained model = True
INFO - Inserted quantizers into the original model
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module>
[rank1]:     main()
[rank1]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main
[rank1]:     model = wrap_the_model_with_ddp(model)
[rank1]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda>
[rank1]:     wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True)
[rank1]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank1]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank1]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
[rank1]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank1]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank1]: Last error:
[rank1]: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
[rank2]: Traceback (most recent call last):
[rank2]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module>
[rank2]:     main()
[rank2]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main
[rank2]:     model = wrap_the_model_with_ddp(model)
[rank2]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda>
[rank2]:     wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True)
[rank2]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank2]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank2]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
[rank2]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank2]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank2]: Last error:
[rank2]: Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 1000
[rank3]: Traceback (most recent call last):
[rank3]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module>
[rank3]:     main()
[rank3]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main
[rank3]:     model = wrap_the_model_with_ddp(model)
[rank3]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda>
[rank3]:     wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True)
[rank3]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank3]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank3]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
[rank3]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank3]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank3]: Last error:
[rank3]: Duplicate GPU detected : rank 3 and rank 0 both on CUDA device 1000
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module>
[rank0]:     main()
[rank0]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main
[rank0]:     model = wrap_the_model_with_ddp(model)
[rank0]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda>
[rank0]:     wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True)
[rank0]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank0]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank0]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
[rank0]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank0]: Last error:
[rank0]: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
W0731 12:48:48.435105 140248971256960 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1049238 closing signal SIGTERM
W0731 12:48:48.435611 140248971256960 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1049240 closing signal SIGTERM
E0731 12:48:48.550357 140248971256960 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 1049239) of binary: /home/tourist/.conda/envs/RFQuant/bin/python
Traceback (most recent call last):
  File "/home/tourist/.conda/envs/RFQuant/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-31_12:48:48
  host      : leo
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 1049241)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-31_12:48:48
  host      : leo
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1049239)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您的这个错误非常简单，我代码已经调通欢迎联系qq：2475684972

liuyiming199721 commented 1 month ago

INFO - >>>>>>>> Epoch 20 Bit-width candidates: [6, 5, 4, 3, 2] INFO - Training: 1076 samples (8 per mini-batch) INFO - Max Training [20][ 20/ 135] Loss 0.392159 QE Loss 0.000000 Distribution Loss 0.000000 IDM Loss 0.000000 Top1 84.375000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 0 Training [20][ 20/ 135] Loss 0.214411 QE Loss 0.000000 Distribution Loss -0.011015 IDM Loss 0.268419 Top1 84.375000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 1 Training [20][ 20/ 135] Loss 0.211393 QE Loss 0.000000 Distribution Loss -0.013542 IDM Loss 0.315946 Top1 83.750000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 2 Training [20][ 20/ 135] Loss 0.198101 QE Loss 0.000000 Distribution Loss -0.009167 IDM Loss 0.256765 Top1 85.000000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - =================================================================================================================== INFO - Max Training [20][ 40/ 135] Loss 0.435212 QE Loss 0.000000 Distribution Loss 0.000000 IDM Loss 0.000000 Top1 85.625000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 0 Training [20][ 40/ 135] Loss 0.238588 QE Loss 0.000000 Distribution Loss -0.010820 IDM Loss 0.248704 Top1 84.062500 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 1 Training [20][ 40/ 135] Loss 0.229547 QE Loss 0.000000 Distribution Loss -0.012565 IDM Loss 0.280446 Top1 83.437500 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 2 Training [20][ 40/ 135] Loss 0.228397 QE Loss 0.000000 Distribution Loss -0.009888 IDM Loss 0.241818 Top1 84.687500 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - =================================================================================================================== INFO - Max Training [20][ 60/ 135] Loss 0.443414 QE Loss 0.000000 Distribution Loss 0.000000 IDM Loss 0.000000 Top1 83.541667 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 0 Training [20][ 60/ 135] Loss 0.236105 QE Loss 0.000000 Distribution Loss -0.011834 IDM Loss 0.267503 Top1 81.875000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 1 Training [20][ 60/ 135] Loss 0.237189 QE Loss 0.000000 Distribution Loss -0.012289 IDM Loss 0.264025 Top1 81.875000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 2 Training [20][ 60/ 135] Loss 0.233082 QE Loss 0.000000 Distribution Loss -0.011429 IDM Loss 0.260655 Top1 83.333333 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - =================================================================================================================== INFO - Max Training [20][ 80/ 135] Loss 0.435657 QE Loss 0.000000 Distribution Loss 0.000000 IDM Loss 0.000000 Top1 83.125000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 0 Training [20][ 80/ 135] Loss 0.235606 QE Loss 0.000000 Distribution Loss -0.011172 IDM Loss 0.258127 Top1 81.718750 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 1 Training [20][ 80/ 135] Loss 0.239393 QE Loss 0.000000 Distribution Loss -0.012690 IDM Loss 0.269106 Top1 81.562500 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 2 Training [20][ 80/ 135] Loss 0.236801 QE Loss 0.000000 Distribution Loss -0.012568 IDM Loss 0.282858 Top1 82.343750 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - =================================================================================================================== INFO - Max Training [20][ 100/ 135] Loss 0.449233 QE Loss 0.000000 Distribution Loss 0.000000 IDM Loss 0.000000 Top1 83.250000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 0 Training [20][ 100/ 135] Loss 0.245851 QE Loss 0.000000 Distribution Loss -0.011134 IDM Loss 0.262823 Top1 80.750000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 1 Training [20][ 100/ 135] Loss 0.252130 QE Loss 0.000000 Distribution Loss -0.012596 IDM Loss 0.262081 Top1 81.250000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 2 Training [20][ 100/ 135] Loss 0.244423 QE Loss 0.000000 Distribution Loss -0.012493 IDM Loss 0.288368 Top1 81.750000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - =================================================================================================================== INFO - Max Training [20][ 120/ 135] Loss 0.455972 QE Loss 0.000000 Distribution Loss 0.000000 IDM Loss 0.000000 Top1 82.812500 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 0 Training [20][ 120/ 135] Loss 0.249014 QE Loss 0.000000 Distribution Loss -0.010803 IDM Loss 0.255996 Top1 80.312500 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 1 Training [20][ 120/ 135] Loss 0.256232 QE Loss 0.000000 Distribution Loss -0.012910 IDM Loss 0.267811 Top1 80.729167 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - Mixed 2 Training [20][ 120/ 135] Loss 0.249085 QE Loss 0.000000 Distribution Loss -0.013087 IDM Loss 0.290680 Top1 81.250000 Top5 100.000000 LR 0.038271 QLR 0.000010
INFO - =================================================================================================================== INFO - ==> Max Top1: 83.022 Top5: 100.000 Loss: 0.450 INFO - ==> Mixed 0 Top1: 80.970 Top5: 100.000 Loss: 0.245 INFO - ==> Mixed 1 Top1: 81.437 Top5: 100.000 Loss: 0.253 INFO - ==> Mixed 2 Top1: 81.623 Top5: 100.000 Loss: 0.247 INFO - Scoreboard best 1 ==> Epoch [20][Top1: 0.000 Top5: 0.000] INFO - Scoreboard best 2 ==> Epoch [19][Top1: 0.000 Top5: 0.000] INFO - Scoreboard best 3 ==> Epoch [18][Top1: 0.000 Top5: 0.000]

liuyiming199721 commented 1 month ago

I am working on this project with RTX A6000-48G and I have met some bugs my command is torchrun --nproc_per_node=4 main.py configs/training/train_resnet18_w2to6_a2to6.yaml nvidia-smi

| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:01:00.0 Off |                  Off |
| 37%   64C    P2             123W / 300W |   2374MiB / 49140MiB |     32%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:25:00.0 Off |                  Off |
| 30%   57C    P2             107W / 300W |   1984MiB / 49140MiB |     35%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               Off | 00000000:41:00.0 Off |                  Off |
| 35%   63C    P2             129W / 300W |   1980MiB / 49140MiB |     34%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000               Off | 00000000:61:00.0 Off |                  Off |
| 30%   58C    P2             130W / 300W |   1958MiB / 49140MiB |     34%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000               Off | 00000000:81:00.0 Off |                  Off |
| 36%   63C    P2             135W / 300W |   1994MiB / 49140MiB |     31%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000               Off | 00000000:A1:00.0 Off |                  Off |
| 30%   58C    P2             136W / 300W |   1980MiB / 49140MiB |     31%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A6000               Off | 00000000:C1:00.0 Off |                  Off |
| 36%   64C    P2             129W / 300W |   1960MiB / 49140MiB |     23%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A6000               Off | 00000000:E1:00.0 Off |                  Off |
| 35%   63C    P2             129W / 300W |   1906MiB / 49140MiB |     31%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3135477      C   python                                     2364MiB |
|    1   N/A  N/A   3135477      C   python                                     1974MiB |
|    2   N/A  N/A   3135477      C   python                                     1972MiB |
|    3   N/A  N/A   3135477      C   python                                     1952MiB |
|    4   N/A  N/A   3135477      C   python                                     1972MiB |
|    5   N/A  N/A   3135477      C   python                                     1972MiB |
|    6   N/A  N/A   3135477      C   python                                     1938MiB |
|    7   N/A  N/A   3135477      C   python                                     1898MiB |
+---------------------------------------------------------------------------------------+

I attach that error

W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] 
W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] *****************************************
W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0731 12:48:42.765966 140248971256960 torch/distributed/run.py:779] *****************************************
Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False})
Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False})
Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False})
Munch({'name': 'resnet18', 'output_dir': 'training', 'training_device': 'gpu', 'num_random_path': 3, 'target_bits': [6, 5, 4, 3, 2], 'post_training_batchnorm_calibration': True, 'information_distortion_mitigation': True, 'enable_dynamic_bit_training': True, 'kd': False, 'dataloader': Munch({'dataset': 'imagenet', 'num_classes': 1000, 'path': '/data/dataset/imagenet', 'batch_size': 128, 'workers': 8, 'deterministic': True}), 'resume': Munch({'path': None, 'lean': False}), 'log': Munch({'num_best_scores': 3, 'print_freq': 20}), 'arch': 'resnet18', 'pre_trained': True, 'quan': Munch({'act': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': True}), 'weight': Munch({'mode': 'lsq', 'bit': 2, 'per_channel': False, 'symmetric': False, 'all_positive': False}), 'excepts': Munch({'excepts_bits_width': 8, 'conv1': Munch({'act': Munch({'bit': None, 'all_positive': False}), 'weight': Munch({'bit': None})}), 'bn1': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})}), 'fc': Munch({'act': Munch({'bit': None}), 'weight': Munch({'bit': None})})})}), 'eval': False, 'epochs': 150, 'smoothing': 0.0, 'scale_gradient': False, 'opt': 'sgd', 'lr': 0.04, 'momentum': 0.9, 'weight_decay': 2.5e-05, 'adaptive_region_weight_decay': 0, 'sched': 'cosine', 'min_lr': 1e-05, 'decay_rate': 0.1, 'warmup_epochs': 5, 'warmup_lr': 1e-05, 'decay_epochs': 30, 'cooldown_epochs': 0, 'ema_decay': 0.9997, 'local_rank': 0, 'split_aw_cands': False})
INFO - Log file for this run: /data/user/tourist/retraining-free-quantization/training/resnet18/resnet18.log
INFO - TensorBoard data directory: /data/user/tourist/retraining-free-quantization/training/resnet18/tb_runs
/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k.
  model = create_fn(
/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k.
  model = create_fn(
/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k.
  model = create_fn(
/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/timm/models/_factory.py:117: UserWarning: Mapping deprecated model name gluon_resnet18_v1b to current resnet18.gluon_in1k.
  model = create_fn(
layer QuanConv2d not using splited a_w cands!layer QuanConv2d not using splited a_w cands!

layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
INFO - Created `resnet18` model for `imagenet` dataset
          Use pre-trained model = True
INFO - Inserted quantizers into the original model
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
layer QuanConv2d not using splited a_w cands!
[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module>
[rank1]:     main()
[rank1]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main
[rank1]:     model = wrap_the_model_with_ddp(model)
[rank1]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda>
[rank1]:     wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True)
[rank1]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank1]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank1]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
[rank1]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank1]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank1]: Last error:
[rank1]: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
[rank2]: Traceback (most recent call last):
[rank2]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module>
[rank2]:     main()
[rank2]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main
[rank2]:     model = wrap_the_model_with_ddp(model)
[rank2]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda>
[rank2]:     wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True)
[rank2]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank2]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank2]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
[rank2]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank2]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank2]: Last error:
[rank2]: Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 1000
[rank3]: Traceback (most recent call last):
[rank3]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module>
[rank3]:     main()
[rank3]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main
[rank3]:     model = wrap_the_model_with_ddp(model)
[rank3]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda>
[rank3]:     wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True)
[rank3]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank3]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank3]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
[rank3]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank3]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank3]: Last error:
[rank3]: Duplicate GPU detected : rank 3 and rank 0 both on CUDA device 1000
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 232, in <module>
[rank0]:     main()
[rank0]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 79, in main
[rank0]:     model = wrap_the_model_with_ddp(model)
[rank0]:   File "/data/user/tourist/retraining-free-quantization/main.py", line 77, in <lambda>
[rank0]:     wrap_the_model_with_ddp = lambda x: DistributedDataParallel(x.cuda(), device_ids=[configs.local_rank], find_unused_parameters=True)
[rank0]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank0]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank0]:   File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
[rank0]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank0]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank0]: Last error:
[rank0]: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
W0731 12:48:48.435105 140248971256960 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1049238 closing signal SIGTERM
W0731 12:48:48.435611 140248971256960 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1049240 closing signal SIGTERM
E0731 12:48:48.550357 140248971256960 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 1049239) of binary: /home/tourist/.conda/envs/RFQuant/bin/python
Traceback (most recent call last):
  File "/home/tourist/.conda/envs/RFQuant/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/tourist/.conda/envs/RFQuant/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-31_12:48:48
  host      : leo
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 1049241)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-31_12:48:48
  host      : leo
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1049239)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

刚注意到您是个土豪这么多显卡我就有2张用1张跑通的。

1hunters commented 1 month ago

Hi, which version of pytorch are you using? We have tested our code with 1.13.0. Meanwhile, please try to modify the L79 of main.py to:

model.cuda() model = DistributedDataParallel(model, device_ids=[args.local_rank], find_unused_parameters=True)

AAAtourist commented 1 month ago

Hi, which version of pytorch are you using? We have tested our code with 1.13.0. Meanwhile, please try to modify the L79 of main.py to:

model.cuda() model = DistributedDataParallel(model, device_ids=[args.local_rank], find_unused_parameters=True)

my pytorch version is 2.4.0 I have modified my code but it still does not work

liuyiming199721 commented 2 weeks ago

我真服了有些人了我好心好意帮他，告诉他我怎么调出来的他跟我这搞我心态拉黑我。

1hunters / retraining-free-quantization

layer QuanConv2d not using splited a_w cands! Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000 #1