different errors occured in crb_sampling.py with both single gpu and multi-gpus

lcc815 commented 1 year ago

Hi, Could you please help solve an error? here is my ACTIVE_TRAIN in pv_rcnn_active_crb.yaml config:

ACTIVE_TRAIN:
    METHOD: crb
    AGGREGATION: mean

    PRE_TRAIN_SAMPLE_NUMS: 100  
    PRE_TRAIN_EPOCH_NUMS: 1
    TRAIN_RESUME: False

    SELECT_NUMS: 100
    SELECT_LABEL_EPOCH_INTERVAL: 1

    TOTAL_BUDGET_NUMS: 200

    ACTIVE_CONFIG:
        K1: 5
        K2: 3
        BANDWIDTH: 5
        CLUSTERING: kmeans++

i run python3 ./train.py --cfg_file cfgs/active-kitti_models/pv_rcnn_active_crb.yaml --batch_size 4 --fix_random_seed --max_ckpt_save_num 200 --ckpt_save_interval 1 and got following error:

Traceback (most recent call last):
  File "./train.py", line 314, in <module>
    main()
  File "./train.py", line 233, in main
    train_func(
  File "/*/CRB-active-3Ddet/tools/train_utils/train_active_utils.py", line 305, in train_model_active
    = active_training_utils.select_active_labels(
  File "/*/CRB-active-3Ddet/tools/../pcdet/utils/active_training_utils.py", line 271, in select_active_labels
    selected_frames = strategy.query(leave_pbar, cur_epoch)
  File "/*/CRB-active-3Ddet/tools/../pcdet/query_strategies/crb_sampling.py", line 264, in query
    x_axis = [np.linspace(-50, int(global_density_max[i])+50, 400) for i in range(num_class)]
  File "/*/CRB-active-3Ddet/tools/../pcdet/query_strategies/crb_sampling.py", line 264, in <listcomp>
    x_axis = [np.linspace(-50, int(global_density_max[i])+50, 400) for i in range(num_class)]
IndexError: list index out of range

I have checked that the variable global_density_max is an empty tensor.

total log is:

...(skip model info)...
2023-02-08 03:57:51,634   INFO  **********************Start training active-kitti_models/pv_rcnn_active_crb(select-100)**********************
2023-02-08 03:57:51,635   INFO  ***** Start Active Pre-train *****
train: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:47<00:00,  1.92s/it, total_it=25]
epochs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:49<00:00, 49.74s/it, loss=5.73, lr=0.000432]
2023-02-08 03:58:41,381   INFO  ***** Complete Active Pre-train *****
2023-02-08 03:58:41,381   INFO  ***** Start Active Train Loop *****
epochs:   0%|                                                                                                                                      | 0/2 [00:00<?, ?it/s**found and enabled 3 Dropout layers for random sampling**                                                                                        | 0/903 [00:00<?, ?it/s]
evaluating_unlabelled_set_epoch_1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 903/903 [05:16<00:00,  2.85it/s]
inf_grads_unlabelled_set_epoch_1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [03:43<00:00,  2.24it/s]
--- {kmeans++} running time: 31.196898698806763 seconds for fc grads---████████████████████████████████████████████████████████████████| 500/500 [03:43<00:00,  2.21it/s]
epochs:   0%|                                                                                                                                      | 0/2 [09:32<?, ?it/s]
Traceback (most recent call last):
  File "./train.py", line 314, in <module>
    main()
  File "./train.py", line 233, in main
    train_func(
  File "/*/CRB-active-3Ddet/tools/train_utils/train_active_utils.py", line 305, in train_model_active
    = active_training_utils.select_active_labels(
  File "/*/CRB-active-3Ddet/tools/../pcdet/utils/active_training_utils.py", line 271, in select_active_labels
    selected_frames = strategy.query(leave_pbar, cur_epoch)
  File "*/CRB-active-3Ddet/tools/../pcdet/query_strategies/crb_sampling.py", line 264, in query
    x_axis = [np.linspace(-50, int(global_density_max[i])+50, 400) for i in range(num_class)]
  File "/*/CRB-active-3Ddet/tools/../pcdet/query_strategies/crb_sampling.py", line 264, in <listcomp>
    x_axis = [np.linspace(-50, int(global_density_max[i])+50, 400) for i in range(num_class)]
IndexError: list index out of range

Actually, I got a different error when i run with multi-gpus with the same config:

2023-02-08 08:38:48,753   INFO  **********************Start training active-kitti_models/pv_rcnn_active_crb(select-100)**********************
2023-02-08 08:38:48,753   INFO  ***** Start Active Pre-train *****
train: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:11<00:00,  2.93s/it, total_it=4]
epochs: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:46<00:00, 46.46s/it, loss=118, lr=0.0075]
2023-02-08 08:39:35,218   INFO  ***** Complete Active Pre-train *****
2023-02-08 08:39:35,218   INFO  ***** Start Active Train Loop *****
epochs:   0%|                                                                                                                                      | 0/2 [00:00<?, ?it/s**found and enabled 3 Dropout layers for random sampling**                                                                                        | 0/113 [00:00<?, ?it/s]
**found and enabled 3 Dropout layers for random sampling**
**found and enabled 3 Dropout layers for random sampling**
**found and enabled 3 Dropout layers for random sampling**
**found and enabled 3 Dropout layers for random sampling**
**found and enabled 3 Dropout layers for random sampling**
**found and enabled 3 Dropout layers for random sampling**
**found and enabled 3 Dropout layers for random sampling**
evaluating_unlabelled_set_epoch_1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 113/113 [00:41<00:00,  2.74it/s]
Traceback (most recent call last):                                                                                                                                       
  File "train.py", line 314, in <module>                                                                                                 | 1/452 [00:02<15:24,  2.05s/it]
    main()
  File "train.py", line 233, in main
    train_func(
  File "/`/CRB-active-3Ddet/tools/train_utils/train_active_utils.py", line 305, in train_model_active
Traceback (most recent call last):                                                                                                                                       
  File "train.py", line 314, in <module>
    main()
  File "train.py", line 233, in main
    = active_training_utils.select_active_labels(
  File "/`/CRB-active-3Ddet/tools/../pcdet/utils/active_training_utils.py", line 271, in select_active_labels
    train_func(
  File "/`/CRB-active-3Ddet/tools/train_utils/train_active_utils.py", line 305, in train_model_active
epochs:   0%|                                                                                                                                      | 0/2 [01:19<?, ?it/s]
    selected_frames = strategy.query(leave_pbar, cur_epoch)
Traceback (most recent call last):
      File "train.py", line 314, in <module>
  File "`CRB-active-3Ddet/tools/../pcdet/query_strategies/crb_sampling.py", line 184, in query
= active_training_utils.select_active_labels(
  File "/`/CRB-active-3Ddet/tools/../pcdet/utils/active_training_utils.py", line 271, in select_active_labels
    selected_frames = strategy.query(leave_pbar, cur_epoch)
  File "/`/CRB-active-3Ddet/tools/../pcdet/query_strategies/crb_sampling.py", line 184, in query
    pred_dicts, _, _= self.model(unlabelled_batch)
  File "/opt/conda/envs/mining/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    main()
  File "train.py", line 233, in main
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/mining/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 692, in forward
    pred_dicts, _, _= self.model(unlabelled_batch)
  File "/opt/conda/envs/mining/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

Luoyadan commented 1 year ago

Hi lcc815,

May I double-check with you if density_list is empty or not? https://github.com/Luoyadan/CRB-active-3Ddet/blob/7c2207c218e735e6362f958fe1db34ff9a401d59/pcdet/query_strategies/crb_sampling.py#L102

We tested our code on a single GPU only and haven't encountered such an error.

Cheers, Yadan

Luoyadan commented 1 year ago

We have now uploaded our checkpoints (PV-RCNN and SECOND) on the KITTI dataset via https://drive.google.com/drive/folders/1PMb6tu84AIw66vCRrMBCHpnBeL5WMkuv?usp=sharing.

You can put them under the folder of 'output/active-kitti_models'. Then the test and visualize scripts can be directly run.

lcc815 commented 1 year ago

Thanks for your help. It turns out that the above mentioned error ourrs, because i only train a few epochs with a few data, in which case the model is not trained well and outputs nothing.

Luoyadan / CRB-active-3Ddet

different errors occured in crb_sampling.py with both single gpu and multi-gpus #3

Actually, I got a different error when i run with multi-gpus with the same config: