How to optimize the CoMFormer on the base classes?

DongSky commented 1 year ago

Hi Fabio. Sincerely thanks for publishing the training code. I have a question about this code. According to training scripts, I could complete the training steps of newly added classes. However, I'm not sure which is the training script of base classes. So could you provide a more detailed readme about the whole training procedure?

Best Regards Bowen

Besides, I noticed that there exist some comments about ``offline training'', but I think this script means the standard training on all classes.

DongSky commented 1 year ago

Solved.

YananGu commented 1 year ago

Hi，@DongSky, I have the same problem as you, can you tell me your solution？ Thanks

fcdl94 commented 1 year ago

Hello and sorry for the missing answer. To train the base classes, it is enough to set CONT.TASK 0 in the config file (or in the scripts).

Please, let me know if you are able to replicate my results with it. Thank you.

YananGu commented 1 year ago

Hi, @fcdl94, can you give me an example for the base classes training, I train the base class by

python train_inc.py --num-gpus 4 --config-file configs/ade20k/semantic-segmentation/maskformer2_R101_bs16_90k.yaml CONT.TASK 0

but I got the following errors:

WARNING [06/14 10:10:57 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model: stem.fc.{bias, weight} [06/14 10:10:57 d2.engine.train_loop]: Starting training from iteration 0 /mnt/project-ext/yy/anaconda3/envs/mask2former/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) /mnt/project-ext/yy/anaconda3/envs/mask2former/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) /mnt/project-ext/yy/anaconda3/envs/mask2former/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) /mnt/project-ext/yy/anaconda3/envs/mask2former/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) /mnt/project-ext/yy/anaconda3/envs/mask2former/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(self, other) /mnt/project-ext/yy/anaconda3/envs/mask2former/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(self, other) /mnt/project-ext/yy/anaconda3/envs/mask2former/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(self, other) /mnt/project-ext/yy/anaconda3/envs/mask2former/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(self, other) ERROR [06/14 10:11:18 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/mnt/project-ext/yy/detectron2/detectron2/engine/train_loop.py", line 155, in train self.run_step() File "/mnt/project-ext/yy/CoMFormer/train_inc.py", line 230, in run_step self._trainer.run_step() File "/mnt/project-ext/yy/detectron2/detectron2/engine/train_loop.py", line 322, in run_step losses.backward() AttributeError: 'int' object has no attribute 'backward' [06/14 10:11:18 d2.engine.hooks]: Total training time: 0:00:21 (0:00:00 on hooks) [06/14 10:11:18 d2.utils.events]: iter: 0 lr: N/A max_mem: 6743M Traceback (most recent call last): File "train_inc.py", line 730, in launch( File "/mnt/project-ext/yy/detectron2/detectron2/engine/launch.py", line 69, in launch mp.start_processes( File "/mnt/project-ext/yy/anaconda3/envs/mask2former/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/mnt/project-ext/yy/anaconda3/envs/mask2former/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/mnt/project-ext/yy/anaconda3/envs/mask2former/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/mnt/project-ext/yy/detectron2/detectron2/engine/launch.py", line 123, in _distributed_worker main_func(args) File "/mnt/project-ext/yy/CoMFormer/train_inc.py", line 721, in main ret = trainer.train() File "/mnt/project-ext/yy/CoMFormer/train_inc.py", line 219, in train super().train(self.start_iter, self.max_iter) File "/mnt/project-ext/yy/detectron2/detectron2/engine/train_loop.py", line 155, in train self.run_step() File "/mnt/project-ext/yy/CoMFormer/train_inc.py", line 230, in run_step self._trainer.run_step() File "/mnt/project-ext/yy/detectron2/detectron2/engine/train_loop.py", line 322, in run_step losses.backward() AttributeError: 'int' object has no attribute 'backward' *

fcdl94 commented 1 year ago

Unfortunately, you need to configure all the parameters. I'll provide an edited version of scripts/ade.sh.

#!/bin/bash

cfg_file=configs/ade20k/semantic-segmentation/maskformer2_R101_bs16_90k.yaml
base=ade_ss

cont_args="CONT.BASE_CLS 100 CONT.INC_CLS 50 CONT.MODE overlap SEED 42"
task=mya_100-50-ov

name=MxF
meth_args="MODEL.MASK_FORMER.TEST.MASK_BG False MODEL.MASK_FORMER.PER_PIXEL False MODEL.MASK_FORMER.SOFTMASK True MODEL.MASK_FORMER.FOCAL True"

comm_args="OUTPUT_DIR ${base} ${meth_args} ${cont_args}"
inc_args="CONT.TASK 0"

## Train base classes
python train_inc.py --num-gpus 4 --config-file ${cfg_file} ${comm_args} ${inc_args} NAME ${name}

## Train step 1
inc_args="CONT.TASK 1 CONT.WEIGHTS ${base}/${task}/${name}/step0/model_final.pth SOLVER.MAX_ITER 20000 SOLVER.BASE_LR 0.00005"
python train_inc.py --num-gpus 4 --config-file ${cfg_file} ${comm_args} ${inc_args} NAME ${name}_PSEUDO_T2_UKD1Rew CONT.DIST.PSEUDO True CONT.DIST.KD_WEIGHT 0.5 CONT.DIST.UKD True CONT.DIST.KD_REW True

fcdl94 / CoMFormer

How to optimize the CoMFormer on the base classes? #1