ValueError: loaded state dict has a different number of parameter groups

WangPingA commented 4 months ago

Hello, after I finished training t1, I then wanted to train t2, and found an error

[06/20 09:19:56 fvcore.common.checkpoint]: Loading trainer from output/M-OWODB/model_final.pth ...
[06/20 09:19:56 d2.engine.hooks]: Loading scheduler from state_dict ...
Traceback (most recent call last):
  File "train_net.py", line 282, in <module>
    launch(
  File "/root/Desktop/OrthogonalDet/detectron2/detectron2/engine/launch.py", line 84, in launch
    main_func(*args)
  File "train_net.py", line 273, in main
    trainer.resume_or_load(resume=args.resume)
  File "/root/Desktop/OrthogonalDet/detectron2/detectron2/engine/defaults.py", line 414, in resume_or_load
    self.checkpointer.resume_or_load(self.cfg.MODEL.WEIGHTS, resume=resume)
  File "/root/anaconda3/envs/Ortho/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 225, in resume_or_load
    return self.load(path)
  File "/root/Desktop/OrthogonalDet/detectron2/detectron2/checkpoint/detection_checkpoint.py", line 62, in load
    ret = super().load(path, *args, **kwargs)
  File "/root/anaconda3/envs/Ortho/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 166, in load
    obj.load_state_dict(checkpoint.pop(key))
  File "/root/Desktop/OrthogonalDet/detectron2/detectron2/engine/defaults.py", line 507, in load_state_dict
    self._trainer.load_state_dict(state_dict["_trainer"])
  File "/root/Desktop/OrthogonalDet/detectron2/detectron2/engine/train_loop.py", line 430, in load_state_dict
    self.optimizer.load_state_dict(state_dict["optimizer"])
  File "/root/anaconda3/envs/Ortho/lib/python3.8/site-packages/torch/optim/optimizer.py", line 141, in load_state_dict
    raise ValueError("loaded state dict has a different number of "
ValueError: loaded state dict has a different number of parameter groups

So I used the following code to continue training, and still found the same error.

#!/bin/bash

BENCHMARK=${BENCHMARK:-"M-OWODB"}  # M-OWODB or S-OWODB
PORT=${PORT:-"50210"}

if [ $BENCHMARK == "M-OWODB" ]; then
  # python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t1 --config-file configs/${BENCHMARK}/t1.yaml

  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t2 --config-file configs/${BENCHMARK}/t2.yaml --resume MODEL.WEIGHTS output/${BENCHMARK}/model_0019999.pth

  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t2_ft --config-file configs/${BENCHMARK}/t2_ft.yaml --resume MODEL.WEIGHTS output/${BENCHMARK}/model_0034999.pth

  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t3 --config-file configs/${BENCHMARK}/t3.yaml --resume MODEL.WEIGHTS output/${BENCHMARK}/model_0049999.pth

  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t3_ft --config-file configs/${BENCHMARK}/t3_ft.yaml --resume MODEL.WEIGHTS output/${BENCHMARK}/model_0064999.pth

  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t4 --config-file configs/${BENCHMARK}/t4.yaml --resume MODEL.WEIGHTS output/${BENCHMARK}/model_0079999.pth

  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t4_ft --config-file configs/${BENCHMARK}/t4_ft.yaml --resume MODEL.WEIGHTS output/${BENCHMARK}/model_0094999.pth
else
  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t1 --config-file configs/${BENCHMARK}/t1.yaml

  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t2 --config-file configs/${BENCHMARK}/t2.yaml --resume MODEL.WEIGHTS output/${BENCHMARK}/model_0039999.pth

  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t2_ft --config-file configs/${BENCHMARK}/t2_ft.yaml --resume MODEL.WEIGHTS output/${BENCHMARK}/model_0054999.pth

  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t3 --config-file configs/${BENCHMARK}/t3.yaml --resume MODEL.WEIGHTS output/${BENCHMARK}/model_0069999.pth

  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t3_ft --config-file configs/${BENCHMARK}/t3_ft.yaml --resume MODEL.WEIGHTS output/${BENCHMARK}/model_0084999.pth

  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t4 --config-file configs/${BENCHMARK}/t4.yaml --resume MODEL.WEIGHTS output/${BENCHMARK}/model_0099999.pth

  python train_net.py --num-gpus 1 --dist-url tcp://127.0.0.1:${PORT} --task ${BENCHMARK}/t4_ft --config-file configs/${BENCHMARK}/t4_ft.yaml --resume MODEL.WEIGHTS output/${BENCHMARK}/model_00114999.pth
fi

The environment information in which I run the code is

[06/20 09:19:33 detectron2]: Environment info:
-------------------------------  --------------------------------------------------------------------------------
sys.platform                     linux
Python                           3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
numpy                            1.24.4
detectron2                       0.6 @/root/Desktop/OrthogonalDet/detectron2/detectron2
Compiler                         GCC 9.4
CUDA compiler                    CUDA 11.3
detectron2 arch flags            7.5
DETECTRON2_ENV_MODULE            <not set>
PyTorch                          1.10.1+cu111 @/root/anaconda3/envs/Ortho/lib/python3.8/site-packages/torch
PyTorch debug build              False
torch._C._GLIBCXX_USE_CXX11_ABI  False
GPU available                    Yes
GPU 0                            NVIDIA GeForce RTX 2080 Ti (arch=7.5)
Driver version                   525.116.04
CUDA_HOME                        /usr/local/cuda
Pillow                           10.3.0
torchvision                      0.11.2+cu111 @/root/anaconda3/envs/Ortho/lib/python3.8/site-packages/torchvision
torchvision arch flags           3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                           0.1.5.post20221221
iopath                           0.1.9
cv2                              4.10.0

WangPingA commented 4 months ago

sorry，I didn't notice this sentence

Note that we are using an ImageNet pre-trained backbone. To switch to a DINO pre-trained backbone, please download the [model weights](https://dl.fbaipublicfiles.com/dino/dino_resnet50_pretrain/dino_resnet50_pretrain.pth) and then follow [these instructions](https://github.com/facebookresearch/detectron2/blob/main/tools/convert-torchvision-to-d2.py).

So I'm probably using the ImageNet pretrained backbone and I'll switch to the DINO pretrained backbone to try it out

feifeiobama commented 4 months ago

I am not sure what is causing this problem, our code should support both ImageNet and DINO pretrained backbones.

Hope to hear more from you soon.

WangPingA commented 4 months ago

Okay, thank you for your reply. I will try to use the DINO pre-trained backbone for training again. If there is still this error, I will try to debug it to eliminate it.

WangPingA commented 4 months ago

Hello, I have been trying to deal with this error since you replied to me. My previous torch version was 2.3.0. I suspected it was a torch version problem, so I downgraded torch to 1.12.0, but I still have the previous error. . When I started debugging, I found that indeed in the load_state_dict method of optimizer.py, the length of groups was inconsistent with the length of saved_groups, as shown in the picture and code.

    def load_state_dict(self, state_dict):
        r"""Loads the optimizer state.

        Args:
            state_dict (dict): optimizer state. Should be an object returned
                from a call to :meth:`state_dict`.
        """
        # deepcopy, to be consistent with module API
        state_dict = deepcopy(state_dict)
        # Validate the state_dict
        groups = self.param_groups
        saved_groups = state_dict['param_groups']

        if len(groups) != len(saved_groups):
            raise ValueError("loaded state dict has a different number of "
                             "parameter groups")
        param_lens = (len(g['params']) for g in groups)
        saved_lens = (len(g['params']) for g in saved_groups)
        if any(p_len != s_len for p_len, s_len in zip(param_lens, saved_lens)):
            raise ValueError("loaded state dict contains a parameter group "
                             "that doesn't match the size of optimizer's group")

But since I didn’t know how to modify it to make the two lengths consistent, I searched online and asked kimi.ai for help. I modified the parameters of the resume_or_load method in defaults.py . So far, the code can run normally. I don’t know if this will have any impact.

    def resume_or_load(self, resume=True):
        """
        If `resume==True` and `cfg.OUTPUT_DIR` contains the last checkpoint (defined by
        a `last_checkpoint` file), resume from the file. Resuming means loading all
        available states (eg. optimizer and scheduler) and update iteration counter
        from the checkpoint. ``cfg.MODEL.WEIGHTS`` will not be used.

        Otherwise, this is considered as an independent training. The method will load model
        weights from the file `cfg.MODEL.WEIGHTS` (but will not load other states) and start
        from iteration 0.

        Args:
            resume (bool): whether to do resume or not
        """
        # self.checkpointer.resume_or_load(self.cfg.MODEL.WEIGHTS, resume=resume)
        self.checkpointer.resume_or_load(self.cfg.MODEL.WEIGHTS, resume=False)
        if resume and self.checkpointer.has_checkpoint():
            # The checkpoint stores the training iteration that just finished, thus we start
            # at the next iteration
            self.start_iter = self.iter + 1

feifeiobama commented 4 months ago

Here is a potential solution: modify the last_checkpoint file in the output folder by changing the text model_final.pth to the one with specific epochs, e.g. model_0019999.pth.

WangPingA commented 4 months ago

Ok I will try it, thank you very much for your reply

WangPingA commented 4 months ago

Dear author, hello! The pretrained weights I currently use are dino_resnet50_pretrain.pkl. Because I currently only have one 2080ti, when I reproduced the results, I only changed the IMS_PER_BATCH in the base.yaml file to 3 (12 to 3). I did not change BASE_LR, STEPS, or MAX_ITER.Now I have completed all M-OWODB tasks and spent 106 hours.

I found that in my results, the value of Unknown Recall50 is very low because I found that each of my classes was detected, and I don't understand the reason for this.Looking forward and thanking for your reply.

The result for t1 is as follows： (t2_ft, t3_ft similar to t1 with a high A-OSE value)


        [06/26 10:09:41] d2.evaluation.evaluator INFO: Total inference time: 1:44:41.608796 (0.613378 s / iter per device, on 1 devices)
    [06/26 10:09:41] d2.evaluation.evaluator INFO: Total inference pure compute time: 1:44:16 (0.610878 s / iter per device, on 1 devices)
    [06/26 10:09:41] core.pascal_voc_evaluation INFO: Evaluating my_val using 2012 metric. Note that results do not use the official Matlab API.
    [06/26 10:09:41] core.pascal_voc_evaluation INFO: aeroplane has 545 predictions.
    [06/26 10:09:43] core.pascal_voc_evaluation INFO: bicycle has 1441 predictions.
    [06/26 10:09:43] core.pascal_voc_evaluation INFO: bird has 459 predictions.
    [06/26 10:09:43] core.pascal_voc_evaluation INFO: boat has 1055 predictions.
    [06/26 10:09:43] core.pascal_voc_evaluation INFO: bottle has 8385 predictions.
    [06/26 10:09:44] core.pascal_voc_evaluation INFO: bus has 630 predictions.
    [06/26 10:09:44] core.pascal_voc_evaluation INFO: car has 11052 predictions.
    [06/26 10:09:45] core.pascal_voc_evaluation INFO: cat has 626 predictions.
    [06/26 10:09:45] core.pascal_voc_evaluation INFO: chair has 10542 predictions.
    [06/26 10:09:46] core.pascal_voc_evaluation INFO: cow has 986 predictions.
    [06/26 10:09:46] core.pascal_voc_evaluation INFO: diningtable has 4666 predictions.
    [06/26 10:09:46] core.pascal_voc_evaluation INFO: dog has 1000 predictions.
    [06/26 10:09:47] core.pascal_voc_evaluation INFO: horse has 647 predictions.
    [06/26 10:09:47] core.pascal_voc_evaluation INFO: motorbike has 1045 predictions.
    [06/26 10:09:47] core.pascal_voc_evaluation INFO: person has 56622 predictions.
    [06/26 10:09:50] core.pascal_voc_evaluation INFO: pottedplant has 3307 predictions.
    [06/26 10:09:51] core.pascal_voc_evaluation INFO: sheep has 430 predictions.
    [06/26 10:09:51] core.pascal_voc_evaluation INFO: sofa has 2437 predictions.
    [06/26 10:09:51] core.pascal_voc_evaluation INFO: train has 637 predictions.
    [06/26 10:09:51] core.pascal_voc_evaluation INFO: tvmonitor has 843 predictions.
    [06/26 10:09:51] core.pascal_voc_evaluation INFO: truck has 2822 predictions.
    [06/26 10:09:52] core.pascal_voc_evaluation INFO: traffic light has 5683 predictions.
    [06/26 10:09:52] core.pascal_voc_evaluation INFO: fire hydrant has 311 predictions.
    [06/26 10:09:52] core.pascal_voc_evaluation INFO: stop sign has 439 predictions.
    [06/26 10:09:52] core.pascal_voc_evaluation INFO: parking meter has 633 predictions.
    [06/26 10:09:53] core.pascal_voc_evaluation INFO: bench has 1976 predictions.
    [06/26 10:09:53] core.pascal_voc_evaluation INFO: elephant has 423 predictions.
    [06/26 10:09:53] core.pascal_voc_evaluation INFO: bear has 171 predictions.
    [06/26 10:09:53] core.pascal_voc_evaluation INFO: zebra has 322 predictions.
    [06/26 10:09:53] core.pascal_voc_evaluation INFO: giraffe has 261 predictions.
    [06/26 10:09:53] core.pascal_voc_evaluation INFO: backpack has 1916 predictions.
    [06/26 10:09:54] core.pascal_voc_evaluation INFO: umbrella has 1736 predictions.
    [06/26 10:09:54] core.pascal_voc_evaluation INFO: handbag has 3367 predictions.
    [06/26 10:09:54] core.pascal_voc_evaluation INFO: tie has 939 predictions.
    [06/26 10:09:54] core.pascal_voc_evaluation INFO: suitcase has 797 predictions.
    [06/26 10:09:55] core.pascal_voc_evaluation INFO: microwave has 339 predictions.
    [06/26 10:09:55] core.pascal_voc_evaluation INFO: oven has 904 predictions.
    [06/26 10:09:55] core.pascal_voc_evaluation INFO: toaster has 291 predictions.
    [06/26 10:09:55] core.pascal_voc_evaluation INFO: sink has 1876 predictions.
    [06/26 10:09:55] core.pascal_voc_evaluation INFO: refrigerator has 778 predictions.
    [06/26 10:09:56] core.pascal_voc_evaluation INFO: frisbee has 639 predictions.
    [06/26 10:09:56] core.pascal_voc_evaluation INFO: skis has 876 predictions.
    [06/26 10:09:56] core.pascal_voc_evaluation INFO: snowboard has 610 predictions.
    [06/26 10:09:56] core.pascal_voc_evaluation INFO: sports ball has 1066 predictions.
    [06/26 10:09:56] core.pascal_voc_evaluation INFO: kite has 2356 predictions.
    [06/26 10:09:56] core.pascal_voc_evaluation INFO: baseball bat has 633 predictions.
    [06/26 10:09:57] core.pascal_voc_evaluation INFO: baseball glove has 966 predictions.
    [06/26 10:09:57] core.pascal_voc_evaluation INFO: skateboard has 733 predictions.
    [06/26 10:09:57] core.pascal_voc_evaluation INFO: surfboard has 1530 predictions.
    [06/26 10:09:57] core.pascal_voc_evaluation INFO: tennis racket has 516 predictions.
    [06/26 10:09:57] core.pascal_voc_evaluation INFO: banana has 1859 predictions.
    [06/26 10:09:58] core.pascal_voc_evaluation INFO: apple has 3382 predictions.
    [06/26 10:09:58] core.pascal_voc_evaluation INFO: sandwich has 368 predictions.
    [06/26 10:09:58] core.pascal_voc_evaluation INFO: orange has 1299 predictions.
    [06/26 10:09:58] core.pascal_voc_evaluation INFO: broccoli has 1476 predictions.
    [06/26 10:09:58] core.pascal_voc_evaluation INFO: carrot has 1679 predictions.
    [06/26 10:09:59] core.pascal_voc_evaluation INFO: hot dog has 115 predictions.
    [06/26 10:09:59] core.pascal_voc_evaluation INFO: pizza has 1245 predictions.
    [06/26 10:09:59] core.pascal_voc_evaluation INFO: donut has 1063 predictions.
    [06/26 10:09:59] core.pascal_voc_evaluation INFO: cake has 5049 predictions.
    [06/26 10:10:00] core.pascal_voc_evaluation INFO: bed has 764 predictions.
    [06/26 10:10:00] core.pascal_voc_evaluation INFO: toilet has 602 predictions.
    [06/26 10:10:00] core.pascal_voc_evaluation INFO: laptop has 1005 predictions.
    [06/26 10:10:00] core.pascal_voc_evaluation INFO: mouse has 995 predictions.
    [06/26 10:10:00] core.pascal_voc_evaluation INFO: remote has 2313 predictions.
    [06/26 10:10:01] core.pascal_voc_evaluation INFO: keyboard has 1351 predictions.
    [06/26 10:10:01] core.pascal_voc_evaluation INFO: cell phone has 1879 predictions.
    [06/26 10:10:01] core.pascal_voc_evaluation INFO: book has 27021 predictions.
    [06/26 10:10:02] core.pascal_voc_evaluation INFO: clock has 887 predictions.
    [06/26 10:10:02] core.pascal_voc_evaluation INFO: vase has 3325 predictions.
    [06/26 10:10:02] core.pascal_voc_evaluation INFO: scissors has 497 predictions.
    [06/26 10:10:02] core.pascal_voc_evaluation INFO: teddy bear has 887 predictions.
    [06/26 10:10:03] core.pascal_voc_evaluation INFO: hair drier has 855 predictions.
    [06/26 10:10:03] core.pascal_voc_evaluation INFO: toothbrush has 534 predictions.
    [06/26 10:10:03] core.pascal_voc_evaluation INFO: wine glass has 3996 predictions.
    [06/26 10:10:03] core.pascal_voc_evaluation INFO: cup has 8249 predictions.
    [06/26 10:10:04] core.pascal_voc_evaluation INFO: fork has 1932 predictions.
    [06/26 10:10:04] core.pascal_voc_evaluation INFO: knife has 3931 predictions.
    [06/26 10:10:04] core.pascal_voc_evaluation INFO: spoon has 3467 predictions.
    [06/26 10:10:04] core.pascal_voc_evaluation INFO: bowl has 4885 predictions.
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: unknown has 381 predictions.
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Wilderness Impact: {0.1: {50: 0.005239933811362382}, 0.2: {50: 0.006680919294494923}, 0.3: {50: 0.007796127085254422}, 0.4: {50: 0.010296517188994544}, 0.5: {50: 0.012150084939803529}, 0.6: {50: 0.014342787662224868}, 0.7: {50: 0.01633074540100203}, 0.8: {50: 0.015584239130434782}, 0.9: {50: 0.012673846919059551}}
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: avg_precision: {0.1: {50: 0.09776536312849161}, 0.2: {50: 0.09776536312849161}, 0.3: {50: 0.09776536312849161}, 0.4: {50: 0.09776536312849161}, 0.5: {50: 0.09776536312849161}, 0.6: {50: 0.09776536312849161}, 0.7: {50: 0.09776536312849161}, 0.8: {50: 0.09776536312849161}, 0.9: {50: 0.09776536312849161}}
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Absolute OSE (total_num_unk_det_as_known): {50: 22414.0}
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: total_num_unk 15606
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'truck', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'bed', 'toilet', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'unknown']
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: AP50: ['66.8', '57.8', '46.4', '32.0', '38.9', '65.8', '60.5', '77.9', '26.2', '63.9', '27.7', '74.7', '73.3', '59.6', '61.0', '26.5', '62.4', '56.7', '73.1', '61.9', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', '0.0']
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Precisions50: ['44.9', '29.1', '75.8', '24.0', '12.7', '45.8', '20.7', '67.0', '13.3', '27.0', '11.6', '55.2', '53.4', '35.4', '22.1', '14.9', '54.6', '16.9', '47.8', '50.7', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '9.2']
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Recall50: ['71.6', '71.3', '49.5', '52.2', '72.7', '75.3', '79.0', '85.5', '54.7', '81.3', '60.6', '85.4', '82.3', '69.6', '86.8', '60.6', '68.9', '82.8', '83.9', '73.6', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', '0.2']
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Current class AP50: 55.65732329527972
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Current class Precisions50: 36.15633967132126
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Current class Recall50: 72.37434007771567
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Known AP50: 55.65732329527972
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Known Precisions50: 36.15633967132126
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Known Recall50: 72.37434007771567
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Unknown AP50: 0.026885526487556526
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Unknown Precisions50: 9.186351706036746
    [06/26 10:10:05] core.pascal_voc_evaluation INFO: Unknown Recall50: 0.22427271562219658
    [06/26 10:10:05] d2.engine.defaults INFO: Evaluation results for my_val in csv format:
    [06/26 10:10:05] d2.evaluation.testing INFO: copypaste: Task: bbox
    [06/26 10:10:05] d2.evaluation.testing INFO: copypaste: AP,AP50
    [06/26 10:10:05] d2.evaluation.testing INFO: copypaste: nan,nan

feifeiobama commented 4 months ago

From my first look, the reason is that the "prediction orthogonalization" design in Section 3.4 depends on the batch size. Estimating the correlation coefficient with a small batch size of 3 and using it as a training objective may lead to unwanted effects.

While I will further investigate this problem, you can temporarily remove this design and see if the issue is solved on t1.

WangPingA commented 4 months ago

After receiving your suggestion, I went to check the relevant code. I think it is controlled by a few lines of parameters in config.py.

    # Disentanglement
    cfg.MODEL.DISENTANGLED = 2  # 0: RandBox, 1: separate head, 2: feature orthogonality
    cfg.MODEL.DECORR_WEIGHT = 1.  # weight for prediction decorrelation loss

Among them, if cfg.MODEL.DISENTANGLED is set to 0, the design of "feature orthogonalization" and "prediction orthogonalization" are removed; if set to 1, only the design of Feature orthogonalization is removed.

So according to your suggestion, I should set cfg.MODEL.DISENTANGLED to 1 to remove the design of "feature orthogonalization ". (3.3 in the paper is Feature orthogonalization)

feifeiobama commented 4 months ago

Sorry I made a typo, this should be Section 3.4. So you should set cfg.MODEL.DISENTANGLED to 2 and cfg.MODEL.DECORR_WEIGHT to 0.

WangPingA commented 4 months ago

Ok, thanks for the reply, I'll try it soon

WangPingA commented 4 months ago

According to your suggestion, I have now obtained normal detection results on task t1, but the mAP value is only 52.2. I guess this may be related to the fact that I only used one 2080ti and the batchsize was set to 3. When I set cfg.MODEL.DISENTANGLED to 2 and cfg.MODEL.DECORR_WEIGHT to 0, the result is as follows：

[06/28 01:06:52 core.pascal_voc_evaluation]: bowl has 1 predictions.
[06/28 01:06:52 core.pascal_voc_evaluation]: unknown has 313494 predictions.
[06/28 01:07:01 core.pascal_voc_evaluation]: Wilderness Impact: {0.1: {50: 0.018734727124626663}, 0.2: {50: 0.02594608277433236}, 0.3: {50: 0.03171696454731907}, 0.4: {50: 0.03628515257714335}, 0.5: {50: 0.03778690504952887}, 0.6: {50: 0.03990561005964022}, 0.7: {50: 0.0357066700784715}, 0.8: {50: 0.02417518663660048}, 0.9: {50: 0.025808214094264155}}
[06/28 01:07:02 core.pascal_voc_evaluation]: avg_precision: {0.1: {50: 0.04821323779225994}, 0.2: {50: 0.030652733308452338}, 0.3: {50: 0.014749827599417668}, 0.4: {50: 0.014749827599417668}, 0.5: {50: 0.014749827599417668}, 0.6: {50: 0.014749827599417668}, 0.7: {50: 0.014749827599417668}, 0.8: {50: 0.014749827599417668}, 0.9: {50: 0.014749827599417668}}
[06/28 01:07:02 core.pascal_voc_evaluation]: Absolute OSE (total_num_unk_det_as_known): {50: 5097.0}
[06/28 01:07:02 core.pascal_voc_evaluation]: total_num_unk 15606
[06/28 01:07:02 core.pascal_voc_evaluation]: ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'truck', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'bed', 'toilet', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'unknown']
[06/28 01:07:02 core.pascal_voc_evaluation]: AP50: ['72.9', '54.3', '58.0', '34.5', '27.6', '67.0', '54.0', '81.1', '18.6', '48.9', '19.5', '76.3', '74.2', '60.2', '51.5', '22.2', '57.3', '42.6', '74.9', '49.8', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '1.1']
[06/28 01:07:02 core.pascal_voc_evaluation]: Precisions50: ['24.4', '15.4', '22.2', '9.9', '7.4', '25.0', '11.7', '44.4', '8.2', '11.3', '12.9', '31.8', '21.5', '21.0', '10.0', '7.7', '12.6', '16.2', '34.2', '18.6', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '1.5']
[06/28 01:07:02 core.pascal_voc_evaluation]: Recall50: ['84.5', '73.7', '73.1', '63.6', '58.6', '81.7', '73.4', '91.4', '50.3', '85.2', '51.0', '90.9', '89.2', '77.8', '80.5', '66.8', '84.8', '74.6', '86.8', '79.3', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '29.6']
[06/28 01:07:02 core.pascal_voc_evaluation]: Current class AP50: 52.268991883938746
[06/28 01:07:02 core.pascal_voc_evaluation]: Current class Precisions50: 18.32555749336644
[06/28 01:07:02 core.pascal_voc_evaluation]: Current class Recall50: 75.86719975619374
[06/28 01:07:02 core.pascal_voc_evaluation]: Known AP50: 52.268991883938746
[06/28 01:07:02 core.pascal_voc_evaluation]: Known Precisions50: 18.32555749336644
[06/28 01:07:02 core.pascal_voc_evaluation]: Known Recall50: 75.86719975619374
[06/28 01:07:02 core.pascal_voc_evaluation]: Unknown AP50: 1.1333977429290294
[06/28 01:07:02 core.pascal_voc_evaluation]: Unknown Precisions50: 1.473712415548623
[06/28 01:07:02 core.pascal_voc_evaluation]: Unknown Recall50: 29.603998462129947
[06/28 01:07:02 d2.engine.defaults]: Evaluation results for my_val in csv format:
[06/28 01:07:02 d2.evaluation.testing]: copypaste: Task: bbox
[06/28 01:07:02 d2.evaluation.testing]: copypaste: AP,AP50
[06/28 01:07:02 d2.evaluation.testing]: copypaste: 12.9199,12.9199

feifeiobama commented 4 months ago

Your results using a batch size of 3 appear to be mixed, as the K-mAP is lower (52.3 compared to 61.3) while the U-Recall is higher (29.6 compared to 18.2). A possible remedy is to adjust the threshold (https://github.com/feifeiobama/OrthogonalDet/blob/main/core/detector.py#L407) by decreasing the seen object threshold and increasing the unseen object threshold. However, I expect that the final performance will still be slightly degraded :(

WangPingA commented 4 months ago

Thank you for your reply. This afternoon, I applied for 4 2080ti from my teacher. I will conduct a new experiment according to the original setting, hoping to achieve similar results to the paper.

WangPingA commented 4 months ago

Excuse me, I have an update on training hours. When I adjusted the batch size to 12, cfg.MODEL.DECORR_WEIGHT to 1, and trained with 4 GPUs, I found that the expected training time on task t1 was longer than my previous settings (batch size 3, 1 GPU), and so was the training time on task t2. If so, my task is that I cannot complete the entire training in the 36 hours mentioned in the paper. Is this normal?

# batch size 3  ，  1 gpu  ,    cfg.MODEL.DISENTANGLED to 2 and cfg.MODEL.DECORR_WEIGHT to 1
# t1
[06/21 18:34:01] d2.engine.hooks INFO: Total training time: 4:19:04 (0:01:07 on hooks)
# t2
[06/22 03:50:12] d2.engine.hooks INFO: Total training time: 7:27:30 (0:01:49 on hooks)

#batch size 12  ， 4 gpus  ,    cfg.MODEL.DISENTANGLED to 2 and cfg.MODEL.DECORR_WEIGHT to 1
#t1
[06/29 04:56:24] d2.engine.hooks INFO: Total training time: 10:40:59 (0:06:18 on hooks)
#t2  It's running, but it's expected to take a long time.
[06/29 06:05:51 d2.utils.events]:  eta: 18:29:41  iter: 299  total_loss: 9.183  loss_ce: 0.5848  loss_bbox: 0.2772  loss_giou: 0.445  loss_nc_ce: 0.06199  loss_decorr: 0.01321  loss_ce_0: 0.7616  loss_bbox_0: 0.4829  loss_giou_0: 0.7449  loss_nc_ce_0: 0.06516  loss_decorr_0: 0.01361  loss_ce_1: 0.6824  loss_bbox_1: 0.3459  loss_giou_1: 0.5079  loss_nc_ce_1: 0.07539  loss_decorr_1: 0.02874  loss_ce_2: 0.6102  loss_bbox_2: 0.3032  loss_giou_2: 0.4782  loss_nc_ce_2: 0.05929  loss_decorr_2: 0.01702  loss_ce_3: 0.5889  loss_bbox_3: 0.284  loss_giou_3: 0.4625  loss_nc_ce_3: 0.05794  loss_decorr_3: 0.007577  loss_ce_4: 0.577  loss_bbox_4: 0.2739  loss_giou_4: 0.455  loss_nc_ce_4: 0.08406  loss_decorr_4: 0.01896    time: 1.9392  last_time: 1.9620  data_time: 0.1743  last_data_time: 0.0164   lr: 7.6255e-06  max_mem: 7621M

Fortunately, my training results on task t1 using 4 GPUs were close to the accuracy of the paper, although it took longer.

                [06/29 05:54:48 core.pascal_voc_evaluation]: unknown has 432569 predictions.
        [06/29 05:55:02 core.pascal_voc_evaluation]: Wilderness Impact: {0.1: {50: 0.018606024808033077}, 0.2: {50: 0.028391608391608394}, 0.3: {50: 0.034031866989954966}, 0.4: {50: 0.03831709893205023}, 0.5: {50: 0.0418150786583556}, 0.6: {50: 0.04116118829775217}, 0.7: {50: 0.04084221207508879}, 0.8: {50: 0.033769953006525265}, 0.9: {50: 0.031485133009204226}}
        [06/29 05:55:03 core.pascal_voc_evaluation]: avg_precision: {0.1: {50: 0.0495697183322219}, 0.2: {50: 0.01976930531889961}, 0.3: {50: 0.01016352736360967}, 0.4: {50: 0.01016352736360967}, 0.5: {50: 0.01016352736360967}, 0.6: {50: 0.01016352736360967}, 0.7: {50: 0.01016352736360967}, 0.8: {50: 0.01016352736360967}, 0.9: {50: 0.01016352736360967}}
        [06/29 05:55:03 core.pascal_voc_evaluation]: Absolute OSE (total_num_unk_det_as_known): {50: 4677.0}
        [06/29 05:55:03 core.pascal_voc_evaluation]: total_num_unk 15606
        [06/29 05:55:03 core.pascal_voc_evaluation]: ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor', 'truck', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'bed', 'toilet', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'unknown']
        [06/29 05:55:03 core.pascal_voc_evaluation]: AP50: ['82.8', '61.4', '66.5', '50.8', '35.8', '73.9', '61.1', '86.9', '27.3', '71.7', '26.5', '83.7', '76.0', '69.6', '59.9', '31.6', '73.5', '56.9', '84.6', '63.8', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '1.3']
        [06/29 05:55:03 core.pascal_voc_evaluation]: Precisions50: ['24.0', '17.0', '17.2', '11.0', '9.0', '24.6', '19.8', '47.8', '10.9', '17.7', '16.7', '38.2', '29.8', '32.9', '15.6', '10.2', '18.3', '20.1', '40.9', '18.5', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '1.0']
        [06/29 05:55:03 core.pascal_voc_evaluation]: Recall50: ['91.9', '76.9', '78.2', '73.7', '66.1', '87.9', '76.2', '93.7', '57.5', '88.9', '56.0', '92.1', '90.9', '81.3', '83.8', '72.6', '92.2', '81.5', '91.3', '86.8', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '28.2']
        [06/29 05:55:03 core.pascal_voc_evaluation]: Current class AP50: 62.205092589307796
        [06/29 05:55:03 core.pascal_voc_evaluation]: Current class Precisions50: 22.01767097874012
        [06/29 05:55:03 core.pascal_voc_evaluation]: Current class Recall50: 80.9655384168531
        [06/29 05:55:03 core.pascal_voc_evaluation]: Known AP50: 62.205092589307796
        [06/29 05:55:03 core.pascal_voc_evaluation]: Known Precisions50: 22.01767097874012
        [06/29 05:55:03 core.pascal_voc_evaluation]: Known Recall50: 80.9655384168531
        [06/29 05:55:03 core.pascal_voc_evaluation]: Unknown AP50: 1.2767929329234504
        [06/29 05:55:03 core.pascal_voc_evaluation]: Unknown Precisions50: 1.0162540542664869
        [06/29 05:55:03 core.pascal_voc_evaluation]: Unknown Recall50: 28.168653082147895

7a5b3cfc23a953762ce01dd7a2b070c

feifeiobama commented 4 months ago

The first thing I notice is that iter: 299 in your t2 log is not normal. The training iteration should continue from t1. You can try to modify the last_checkpoint file as mentioned above, or start with an empty output folder. This should reduce the overall runtime significantly.

Another thing is that your training time on t1 is longer than mine (10:40:59 compared to 7:11:59). I do not know how to fix this, as it may be related to server specs or other issues.

WangPingA commented 4 months ago

What you mean is that the number of training iterations of t2 should follow the number of training iterations of t1, that is, starting from 20,000.

feifeiobama commented 4 months ago

Yes, you are correct.

WangPingA commented 4 months ago

Okay, I'll try

geek-APTX4869 commented 2 weeks ago

Here is a potential solution: modify the last_checkpoint file in the output folder by changing the text model_final.pth to the one with specific epochs, e.g. model_0019999.pth.

修改为model_0014999.pth仍然有这个问题

feifeiobama commented 2 weeks ago

Here is a potential solution: modify the last_checkpoint file in the output folder by changing the text model_final.pth to the one with specific epochs, e.g. model_0019999.pth.

修改为model_0014999.pth仍然有这个问题

This may be due to resuming from an incompatible checkpoint. A more complete solution is to delete the output folder and restart from the first task.

geek-APTX4869 commented 2 weeks ago

我运行的是IOD任务，删除文件夹没用

feifeiobama commented 2 weeks ago

我运行的是IOD任务，删除文件夹没用

I am not sure what is causing this. Perhaps downgrade torch and torchvision to torch==1.9.0. Or you could try resuming from my existing checkpoint on the Google Drive.

feifeiobama / OrthogonalDet

ValueError: loaded state dict has a different number of parameter groups #2