Bunch of Issues - Githubissues

satpalsr commented 1 year ago

Hey @Hi-FT @JacobYuan7 Found bunch of issues. Do you plan on fixing them anytime soon?

I installed with

conda create -n erd python=3.8
conda activate erd
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch
pip install mmcv-full==1.2.7 -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.7/index.html
pip install -e .

fcos_head_tune、fcos_head_incre were missing, found them in build/lib/mmdet.
Received Resnet : __init__() got an unexpected keyword argument 'init_cfg' . Could not figure out how to resolve so switch to build/lib/mmdet

Copied some missing files. But


Traceback (most recent call last):
File /home/anaconda3/envs/erd/lib/python3.8/site-packages/mmcv/utils/registry.py", line 179, in build_from_cfg
return obj_cls(**args)
File /home/Projects/ERD/mmdet/models/dense_heads/gfl_head_incre.py", line 41, in __init__
self.loss_ld = build_loss(loss_ld)
File /home/Projects/ERD/mmdet/models/builder.py", line 64, in build_loss
return build(cfg, LOSSES)
File /home/Projects/ERD/mmdet/models/builder.py", line 34, in build
return build_from_cfg(cfg, registry, default_args)
File /home/anaconda3/envs/erd/lib/python3.8/site-packages/mmcv/utils/registry.py", line 171, in build_from_cfg
raise KeyError(
KeyError: 'KnowledgeDistillationKLDivLoss is not in the loss registry'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File /home/anaconda3/envs/erd/lib/python3.8/site-packages/mmcv/utils/registry.py", line 179, in build_from_cfg return obj_cls(**args) File /home/Projects/ERD/mmdet/models/detectors/gfl_incre.py", line 34, in init super().init(backbone, neck, bbox_head, train_cfg, File /home/Projects/ERD/mmdet/models/detectors/single_stage.py", line 30, in init self.bbox_head = build_head(bbox_head) File /home/Projects/ERD/mmdet/models/builder.py", line 59, in build_head return build(cfg, HEADS) File /home/Projects/ERD/mmdet/models/builder.py", line 34, in build return build_from_cfg(cfg, registry, default_args) File /home/anaconda3/envs/erd/lib/python3.8/site-packages/mmcv/utils/registry.py", line 182, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') KeyError: "GFLHeadIncre: 'KnowledgeDistillationKLDivLoss is not in the loss registry'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "tools/train.py", line 187, in main() File "tools/train.py", line 158, in main model = build_detector( File /home/Projects/ERD/mmdet/models/builder.py", line 77, in build_detector return build(cfg, DETECTORS, dict(train_cfg=train_cfg, test_cfg=test_cfg)) File /home/Projects/ERD/mmdet/models/builder.py", line 34, in build return build_from_cfg(cfg, registry, default_args) File /home/anaconda3/envs/erd/lib/python3.8/site-packages/mmcv/utils/registry.py", line 182, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') KeyError: 'GFLIncre: "GFLHeadIncre: \'KnowledgeDistillationKLDivLoss is not in the loss registry\'"'


4. Decided to use latest versions and copied some required dense head and detector files

conda create -n mmdet python=3.8 conda activate mmdet conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge

pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10/index.html

git clone https://github.com/open-mmlab/mmdetection.git cd mmdetection pip install -e .


But then it was comparing [original number of classes](https://github.com/Hi-FT/ERD/blob/af04fee67b49716856578dc440dc814b7122217c/configs/gfl_incre/gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py#L10) with new dataset classes (i.e. 40 with 80 and gave error). I just changed number of original classes to 80 for time being.

5. [Can't add tuples so converted them to list](https://github.com/Hi-FT/ERD/blob/af04fee67b49716856578dc440dc814b7122217c/mmdet/models/detectors/gfl_incre.py#L225)

6. Values here in 5th point are more than required. So used outs[:2] only.

7. [Assertion error](https://github.com/Hi-FT/ERD/blob/af04fee67b49716856578dc440dc814b7122217c/mmdet/models/dense_heads/gfl_head_incre.py#L190)

Thanks

afeiJ commented 1 year ago

@satpalsr I had the same problem，and have you solved it now？

satpalsr commented 1 year ago

@afeiJ No. Many changes are required.

jingong commented 1 year ago

I alse have the same problem, have you solved the problem?

satpalsr commented 1 year ago

No, even if I solve one issue, there's another one waiting. I could not find enough time to solve all of them. Need help from @Hi-FT and @JacobYuan7

Parsifal133 commented 1 year ago

安装最新的mmdetection能跑。我现在在跑增量训练的代码gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py，但是训练到一半loss就变nan了

xinlong007 commented 1 year ago

@Parsifal133 很反复吗？我在训练前面的模型时出现过，重新再训练一次就好了。请问你在训练gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py时，遇到过 File "/home/ERD/mmdet/models/detectors/gfl_incre.py", line 148, in cls_score.permute(0, 2, 3, 1).reshape( RuntimeError: number of dims don't match in permute 这样的问题吗？

Parsifal133 commented 1 year ago

@Parsifal133 很反复吗？我在训练前面的模型时出现过，重新再训练一次就好了。请问你在训练gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py时，遇到过 File "/home/ERD/mmdet/models/detectors/gfl_incre.py", line 148, in cls_score.permute(0, 2, 3, 1).reshape( RuntimeError: number of dims don't match in permute 这样的问题吗？

cls_scores的维度不匹配代码这里是对原始模型输出的cls_scores做一个维度的转换，后面的cat_cls_scores = torch.cat(cat_cls_scores, dim=1)再对reshape的cls做拼接建议你打印当前cls_score的尺寸： cls_score.size() #应该形如 (batchsize,ori_cls_num,H,W） cls_score.permute(0,2,3,1).size() #应该形如（batchsize,H,W,ori_cls_num）

Parsifal133 commented 1 year ago

@Parsifal133 很反复吗？我在训练前面的模型时出现过，重新再训练一次就好了。请问你在训练gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py时，遇到过 File "/home/ERD/mmdet/models/detectors/gfl_incre.py", line 148, in cls_score.permute(0, 2, 3, 1).reshape( RuntimeError: number of dims don't match in permute 这样的问题吗？

也许你应该尝试最新的mmdetection、mmcv和torch

xinlong007 commented 1 year ago

@Parsifal133 请教一下您的cuda，mmcv，mmdetction，pytorch的版本？我是cuda11.0， mmcv1.2.7, mmdetection 2.10.0, pytotch 1.7.1

Parsifal133 commented 1 year ago

@Parsifal133 请教一下您的cuda，mmcv，mmdetction，pytorch的版本？我是cuda11.0， mmcv1.2.7, mmdetection 2.10.0, pytotch 1.7.1

2023-01-06 10:24:44,896 - mmdet - INFO - Environment info:

sys.platform: linux Python: 3.8.8 (default, Apr 13 2021, 19:58:26) [GCC 7.3.0] CUDA available: True GPU 0: GeForce GTX 1080 Ti CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 9.0, V9.0.17 GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.12.1 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.2-Product Build 20210312 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.2
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1 OpenCV: 4.6.0 MMCV: 1.7.1 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.2 MMDetection: 2.26.0+

jingong commented 1 year ago

请问在训练gfl_r50_fpn_1x_coco_first_40_tune_last_40_cats.py有没有出现loss变为nan的情况

xinlong007 commented 1 year ago

@Parsifal133 感谢对环境的讲解，后来我也反复出现nan的问题，当时我是用的原始coco2017数据训练的，再后来我对数据进行了处理，只保留后40类的数据，并将类别从1重新开始标记，训练过程loss就有值了。 @jingong

2023.02.09 补充：实践证明，上述方式不正确，请不要参考制作后40类的数据，不好意思。

jingong commented 1 year ago

@Parsifal133 感谢对环境的讲解，后来我也反复出现nan的问题，当时我是用的原始coco2017数据训练的，再后来我对数据进行了处理，只保留后40类的数据，并将类别从1重新开始标记，训练过程loss就有值了。 @jingong

感谢~，您的意思是对前40类重新编号为1-40，后40类也重新编号为1-40吗

xinlong007 commented 1 year ago

我制作了两个json，第一个json是把coco数据中前40类挑出来（原始标记是1-44），我只是单纯挑出来做了json，并没有对标号进行修改；第二个json是是把coco数据中后40类挑出来（原始标记是46-90），我只是将原有的每个标记减45，这样的就从1开始了。总体来说不是纯粹的1-40，只要从1开始就好。 ps 你训练前40类的模型了吗？（gfl_r50_fpn_1x_coco_first_40_cats）

jingong commented 1 year ago

我制作了两个json，第一个json是把coco数据中前40类挑出来（原始标记是1-44），我只是单纯挑出来做了json，并没有对标号进行修改；第二个json是是把coco数据中后40类挑出来（原始标记是46-90），我只是将原有的每个标记减45，这样的就从1开始了。总体来说不是纯粹的1-40，只要从1开始就好。 ps 你训练前40类的模型了吗？（gfl_r50_fpn_1x_coco_first_40_cats）

我训练前40类模型了，前40类没有出现loss为nan的问题，但是在用后40类进行增量时，第一个epoch训练一半就会出现loss为nan

xinlong007 commented 1 year ago

我制作了两个json，第一个json是把coco数据中前40类挑出来（原始标记是1-44），我只是单纯挑出来做了json，并没有对标号进行修改；第二个json是是把coco数据中后40类挑出来（原始标记是46-90），我只是将原有的每个标记减45，这样的就从1开始了。总体来说不是纯粹的1-40，只要从1开始就好。 ps 你训练前40类的模型了吗？（gfl_r50_fpn_1x_coco_first_40_cats）

我训练前40类模型了，前40类没有出现loss为nan的问题，但是在用后40类进行增量时，第一个epoch训练一半就会出现loss为nan ok 你先这么试试

Parsifal133 commented 1 year ago

标签正确的话，也会在中途出现loss=nan的情况，我当时降低学习率（如0.01-> 0.001）就好了

jingong commented 1 year ago

我制作了两个json，第一个json是把coco数据中前40类挑出来（原始标记是1-44），我只是单纯挑出来做了json，并没有对标号进行修改；第二个json是是把coco数据中后40类挑出来（原始标记是46-90），我只是将原有的每个标记减45，这样的就从1开始了。总体来说不是纯粹的1-40，只要从1开始就好。 ps 你训练前40类的模型了吗？（gfl_r50_fpn_1x_coco_first_40_cats）

我训练前40类模型了，前40类没有出现loss为nan的问题，但是在用后40类进行增量时，第一个epoch训练一半就会出现loss为nan ok 你先这么试试

您用前40类增量后40类之后的精度是多少，我这边增量后bbox_mAP: 0.3040, bbox_mAP_50: 0.4550，mAP_50精度比论文上还差9%

xinlong007 commented 1 year ago

我制作了两个json，第一个json是把coco数据中前40类挑出来（原始标记是1-44），我只是单纯挑出来做了json，并没有对标号进行修改；第二个json是是把coco数据中后40类挑出来（原始标记是46-90），我只是将原有的每个标记减45，这样的就从1开始了。总体来说不是纯粹的1-40，只要从1开始就好。 ps 你训练前40类的模型了吗？（gfl_r50_fpn_1x_coco_first_40_cats）

我训练前40类模型了，前40类没有出现loss为nan的问题，但是在用后40类进行增量时，第一个epoch训练一半就会出现loss为nan ok 你先这么试试

您用前40类增量后40类之后的精度是多少，我这边增量后bbox_mAP: 0.3040, bbox_mAP_50: 0.4550，mAP_50精度比论文上还差9%

我这边结果是这样的： mAP@0.50:0.95 of the first 40 cats: 0.000 mAP@0.50:0.95 of the last 40 cats: 0.035 OrderedDict([('bbox_mAP', 0.018), ('bbox_mAP_50', 0.026), ('bbox_mAP_75', 0.02), ('bbox_mAP_s', 0.003), ('bbox_mAP_m', 0.015), ('bbox_mAP_l', 0.031), ('bbox_mAP_copypaste', '0.018 0.026 0.020 0.003 0.015 0.031')])

我在想是不是哪里配置错了，结果不是很理想，同时前40类并没有被测试。我这个结果是前40类训练12个epoch，后40类是在这个基础上训练12个epoch，您也是这样的训练数量吗？请问您对配置文件修改的多吗？我是把gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py文件中ori_num_classes设置为80，num_classes设置为80，要不会报错，请问您也是这么配置的吗？或者有其他配置操作？

xinlong007 commented 1 year ago

标签正确的话，也会在中途出现loss=nan的情况，我当时降低学习率（如0.01-> 0.001）就好了

好的明白请问您对配置文件有一些特殊的修改吗？

lonelyqian commented 1 year ago

Does anyone meet such error? I have already used the latest mmcv, mmdet, pytorch and torchvision. It is quiet interesting that this error only happens when I try to train "first_40_incre_last_40" on multi_gpus by run the command “./dist_train.sh /configs/gfl_incre/gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py 4”. However, when I try to do the same thing on a single gpu, this error is disappear. The error as below:

2023-02-05 23:43:40,009 - mmdet - INFO - Epoch [1][50/6825] lr: 9.890e-04, eta: 9:54:33, time: 0.436, data_time: 0.074, memory: 5618, loss_cls: 0.1986, loss_bbox: 1.4780, loss_dfl: 0.6856, loss_dist_cls: 0.0573, loss_dist_bbox: 0.0152, loss: 2.4347
Traceback (most recent call last):
  File "./train.py", line 247, in <module>
    main()
  File "./train.py", line 236, in main
    train_detector(
  File "/root/autodl-tmp/qfs/workspace/IOD/mmdetection/mmdet/apis/train.py", line 246, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/root/miniconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/root/miniconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/root/miniconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/root/miniconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/root/autodl-tmp/qfs/workspace/IOD/mmdetection/mmdet/models/detectors/base.py", line 248, in train_step
    losses = self(**data)
  File "/root/miniconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/miniconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func
    return old_func(*args, **kwargs)
  File "/root/autodl-tmp/qfs/workspace/IOD/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/root/autodl-tmp/qfs/workspace/IOD/mmdetection/mmdet/models/detectors/gfl_incre.py", line 230, in forward_train
    losses = self.bbox_head.loss(*loss_inputs)
  File "/root/miniconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 208, in new_func
    return old_func(*args, **kwargs)
  File "/root/autodl-tmp/qfs/workspace/IOD/mmdetection/mmdet/models/dense_heads/gfl_head_incre.py", line 332, in loss
    _, keep_1 = batched_nms(thr_bboxes_1, thr_scores_1, thr_id_1, nms_cfg)
  File "/root/miniconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/ops/nms.py", line 339, in batched_nms
    max_coordinate = boxes.max()
RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

lonelyqian commented 1 year ago

@jingong @Parsifal133 @xinlong007 @ 想问下，各位大佬

我环境是这么配置的：先安装好最新的mmdet（2.26.0），然后将ERS中的新增组件加入其中不知道这种配置方法是否正确？
我出现了和这位 @xinlong007 一样的情况：训练前四十类“configs/gfl_incre/gfl_r50_fpn_1x_coco_first_40_cats.py”， ap正常然而在前者基础上训练 configs/gfl_incre/gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py， ap非常小接近于0，但是loss没有出现nan 我是直接采用了ERS中提供的对训练json的划分代码想问下大佬们是否对这两个config以及ERS某些代码进行过修改？

xinlong007 commented 1 year ago

@lonelyqian 您好, “ERS中提供的对训练json的划分代码”指的是？我可能没找到，还请帮忙指明一下。我想建个微信群，大家一块讨论一下，如果方便的话大家可以把微信号发到fw422@sina.com 我来加大家组群

shenxiangkei commented 1 year ago

@Parsifal133 感谢对环境的讲解，后来我也反复出现nan的问题，当时我是用的原始coco2017数据训练的，再后来我对数据进行了处理，只保留后40类的数据，并将类别从1重新开始标记，训练过程loss就有值了。 @jingong

您好，在进行后40类增量学习时，val验证集是不是应该同时包含前40类和后40类呢？这样的话训练集标签（train2017.json）不是仍然是1-80吗，只不过前边40类样本不在训练集里边？

xinlong007 commented 1 year ago

@Parsifal133 感谢对环境的讲解，后来我也反复出现nan的问题，当时我是用的原始coco2017数据训练的，再后来我对数据进行了处理，只保留后40类的数据，并将类别从1重新开始标记，训练过程loss就有值了。 @jingong

您好，在进行后40类增量学习时，val验证集是不是应该同时包含前40类和后40类呢？这样的话训练集标签（train2017.json）不是仍然是1-80吗，只不过前边40类样本不在训练集里边？

是的是的，我说的确实有问题

Zesheng666 commented 1 year ago

fw422@sina.com

epoch_12.pth 文件在哪找啊

xinlong007 commented 1 year ago

@Zesheng666 这个需要自己训练

shixingy commented 1 year ago

@Zesheng666 这个需要自己训练

能加下微信细聊吗 yzs1721770653

你得到那个epoch文件了么，我在另一个issue里看到有人提供了，但是训练后得到的结果前四十类的map只有0.199，跟论文中差很多，在尝试自己训练时我不太清楚这个流程，我用那个resnet50的.pth文件去训练会报一些错误，这个得到前四十类的权重是怎么样的流程？

shixingy commented 1 year ago

@jingong @Parsifal133 @xinlong007 @ 想问下，各位大佬

我环境是这么配置的：先安装好最新的mmdet（2.26.0），然后将ERS中的新增组件加入其中不知道这种配置方法是否正确？

我出现了和这位 @xinlong007 一样的情况：训练前四十类“configs/gfl_incre/gfl_r50_fpn_1x_coco_first_40_cats.py”， ap正常然而在前者基础上训练 configs/gfl_incre/gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py， ap非常小接近于0，但是loss没有出现nan 我是直接采用了ERS中提供的对训练json的划分代码想问下大佬们是否对这两个config以及ERS某些代码进行过修改？

可以交流一下么

jingong commented 1 year ago

@Zesheng666 这个需要自己训练

能加下微信细聊吗 yzs1721770653

你得到那个epoch文件了么，我在另一个issue里看到有人提供了，但是训练后得到的结果前四十类的map只有0.199，跟论文中差很多，在尝试自己训练时我不太清楚这个流程，我用那个resnet50的.pth文件去训练会报一些错误，这个得到前四十类的权重是怎么样的流程？

链接：https://pan.baidu.com/s/1Kd48CZ-rKgPEH5MR-5__0Q 提取码：99yk 你可以试试这个权重，不过是我个人训练的，first40，Epoch(val) [12][12149] bbox_mAP: 0.4510, bbox_mAP_50: 0.6490, bbox_mAP_75: 0.4840, bbox_mAP_s: 0.2960, bbox_mAP_m: 0.4950, bbox_mAP_l: 0.5530, bbox_mAP_copypaste: 0.451 0.649 0.484 0.296 0.495 0.553，不代表作者的权重哈

shixingy commented 1 year ago

ok，非常感谢，，想请教一下做增量训练的时候需要对配置文件做些什么修改么，我前四十类的map下降的太多了，还有自己训练前四十类的权重就是用这个文件gfl_r50_fpn_1x_coco_first_40_cats.py做训练吧，下载一个resnet的预训练权重？

------------------ 原始邮件 ------------------ 发件人: "Hi-FT/ERD" @.>; 发送时间: 2023年3月28日(星期二) 下午4:07 @.>; @.**@.>; 主题: Re: [Hi-FT/ERD] Bunch of Issues (Issue #7)

@Zesheng666 这个需要自己训练

能加下微信细聊吗 yzs1721770653

你得到那个epoch文件了么，我在另一个issue里看到有人提供了，但是训练后得到的结果前四十类的map只有0.199，跟论文中差很多，在尝试自己训练时我不太清楚这个流程，我用那个resnet50的.pth文件去训练会报一些错误，这个得到前四十类的权重是怎么样的流程？

链接：https://pan.baidu.com/s/1Kd48CZ-rKgPEH5MR-5__0Q 提取码：99yk 你可以试试这个权重，不过是我个人训练的，first40，Epoch(val) [12][12149] bbox_mAP: 0.4510, bbox_mAP_50: 0.6490, bbox_mAP_75: 0.4840, bbox_mAP_s: 0.2960, bbox_mAP_m: 0.4950, bbox_mAP_l: 0.5530, bbox_mAP_copypaste: 0.451 0.649 0.484 0.296 0.495 0.553

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Zesheng666 commented 1 year ago

我制作了两个json，第一个json是把coco数据中前40类挑出来（原始标记是1-44），我只是单纯挑出来做了json，并没有对标号进行修改；第二个json是是把coco数据中后40类挑出来（原始标记是46-90），我只是将原有的每个标记减45，这样的就从1开始了。总体来说不是纯粹的1-40，只要从1开始就好。 ps 你训练前40类的模型了吗？（gfl_r50_fpn_1x_coco_first_40_cats）

我训练前40类模型了，前40类没有出现loss为nan的问题，但是在用后40类进行增量时，第一个epoch训练一半就会出现loss为nan ok 你先这么试试

您用前40类增量后40类之后的精度是多少，我这边增量后bbox_mAP: 0.3040, bbox_mAP_50: 0.4550，mAP_50精度比论文上还差9%

我这边结果是这样的： mAP@0.50:0.95 of the first 40 cats: 0.000 mAP@0.50:0.95 of the last 40 cats: 0.035 OrderedDict([('bbox_mAP', 0.018), ('bbox_mAP_50', 0.026), ('bbox_mAP_75', 0.02), ('bbox_mAP_s', 0.003), ('bbox_mAP_m', 0.015), ('bbox_mAP_l', 0.031), ('bbox_mAP_copypaste', '0.018 0.026 0.020 0.003 0.015 0.031')])

我在想是不是哪里配置错了，结果不是很理想，同时前40类并没有被测试。我这个结果是前40类训练12个epoch，后40类是在这个基础上训练12个epoch，您也是这样的训练数量吗？请问您对配置文件修改的多吗？我是把gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py文件中ori_num_classes设置为80，num_classes设置为80，要不会报错，请问您也是这么配置的吗？或者有其他配置操作？

遇到相同得问题，最后您是怎么解决的？

xinlong007 commented 1 year ago

我制作了两个json，第一个json是把coco数据中前40类挑出来（原始标记是1-44），我只是单纯挑出来做了json，并没有对标号进行修改；第二个json是是把coco数据中后40类挑出来（原始标记是46-90），我只是将原有的每个标记减45，这样的就从1开始了。总体来说不是纯粹的1-40，只要从1开始就好。 ps 你训练前40类的模型了吗？（gfl_r50_fpn_1x_coco_first_40_cats）

我训练前40类模型了，前40类没有出现loss为nan的问题，但是在用后40类进行增量时，第一个epoch训练一半就会出现loss为nan ok 你先这么试试

您用前40类增量后40类之后的精度是多少，我这边增量后bbox_mAP: 0.3040, bbox_mAP_50: 0.4550，mAP_50精度比论文上还差9%

我这边结果是这样的： mAP@0.50:0.95 of the first 40 cats: 0.000 mAP@0.50:0.95 of the last 40 cats: 0.035 OrderedDict([('bbox_mAP', 0.018), ('bbox_mAP_50', 0.026), ('bbox_mAP_75', 0.02), ('bbox_mAP_s', 0.003), ('bbox_mAP_m', 0.015), ('bbox_mAP_l', 0.031), ('bbox_mAP_copypaste', '0.018 0.026 0.020 0.003 0.015 0.031')]) 我在想是不是哪里配置错了，结果不是很理想，同时前40类并没有被测试。我这个结果是前40类训练12个epoch，后40类是在这个基础上训练12个epoch，您也是这样的训练数量吗？请问您对配置文件修改的多吗？我是把gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py文件中ori_num_classes设置为80，num_classes设置为80，要不会报错，请问您也是这么配置的吗？或者有其他配置操作？

遇到相同得问题，最后您是怎么解决的？

你用的mmdet是ERD作者源码中的ERD，还是官方的mmdetection？我最开始使用官方的mmdetection出现了这样的问题，使用作者mmdet好像就好了。同时config中，ori_num_classes应该设置为40，num_classes设置为80。

shixingy commented 1 year ago

怎么叫用作者源码中的mmdet？

---Original--- From: @.> Date: Tue, Apr 4, 2023 09:06 AM To: @.>; Cc: @.**@.>; Subject: Re: [Hi-FT/ERD] Bunch of Issues (Issue #7)

我制作了两个json，第一个json是把coco数据中前40类挑出来（原始标记是1-44），我只是单纯挑出来做了json，并没有对标号进行修改；第二个json是是把coco数据中后40类挑出来（原始标记是46-90），我只是将原有的每个标记减45，这样的就从1开始了。总体来说不是纯粹的1-40，只要从1开始就好。 ps 你训练前40类的模型了吗？（gfl_r50_fpn_1x_coco_first_40_cats）

我训练前40类模型了，前40类没有出现loss为nan的问题，但是在用后40类进行增量时，第一个epoch训练一半就会出现loss为nan ok 你先这么试试

您用前40类增量后40类之后的精度是多少，我这边增量后bbox_mAP: 0.3040, bbox_mAP_50: 0.4550，mAP_50精度比论文上还差9%

我这边结果是这样的： @.:0.95 of the first 40 cats: 0.000 @.:0.95 of the last 40 cats: 0.035 OrderedDict([('bbox_mAP', 0.018), ('bbox_mAP_50', 0.026), ('bbox_mAP_75', 0.02), ('bbox_mAP_s', 0.003), ('bbox_mAP_m', 0.015), ('bbox_mAP_l', 0.031), ('bbox_mAP_copypaste', '0.018 0.026 0.020 0.003 0.015 0.031')]) 我在想是不是哪里配置错了，结果不是很理想，同时前40类并没有被测试。我这个结果是前40类训练12个epoch，后40类是在这个基础上训练12个epoch，您也是这样的训练数量吗？请问您对配置文件修改的多吗？我是把gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py文件中ori_num_classes设置为80，num_classes设置为80，要不会报错，请问您也是这么配置的吗？或者有其他配置操作？

遇到相同得问题，最后您是怎么解决的？

你用的mmdet是ERD作者源码中的ERD，还是官方的mmdetection？我最开始使用官方的mmdetection出现了这样的问题，使用作者mmdet好像就好了。

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

oldfashionyyf commented 1 year ago

instances_val2017_sel_first_40_cats.json和last_40.json怎么得到啊

shixingy commented 1 year ago

源代码中scripts中有个select_categories.py,用那个

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2023年4月4日(星期二) 晚上7:06 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [Hi-FT/ERD] Bunch of Issues (Issue #7)

instances_val2017_sel_first_40_cats.json和last_40.json怎么得到啊

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

oldfashionyyf commented 1 year ago

源代码中scripts中有个select_categories.py,用那个 … ------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2023年4月4日(星期二) 晚上7:06 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [Hi-FT/ERD] Bunch of Issues (Issue #7) instances_val2017_sel_first_40_cats.json和last_40.json怎么得到啊 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

非常感谢！再请问scripts下的另一个find_repetitive_images.py的作用是什么

shixingy commented 1 year ago

源代码中scripts中有个selectcategories.py,用那个 … ------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2023年4月4日(星期二) 晚上7:06 收件人: _@_.>; 抄送: @.>; _@_._>; 主题: Re: [Hi-FT/ERD] Bunch of Issues (Issue #7) instances_val2017_sel_first_40_cats.json和last40.json怎么得到啊 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @_.***>

非常感谢！再请问scripts下的另一个find_repetitive_images.py的作用是什么

不太清楚，看着像找前后两个json中重复的

shixingy commented 1 year ago

增量训练后，精度和文中的完全不符，这是因为什么，有人复现出来的精度，前四十类的map没有下降很多么

jingong commented 1 year ago

增量训练后，精度和文中的完全不符，这是因为什么，有人复现出来的精度，前四十类的map没有下降很多么

我训练出来的结果好像比你的还好一点。没找到具体问题是出在哪里？ Epoch(val) [12][13649] bbox_mAP: 0.3040, bbox_mAP_50: 0.4550, bbox_mAP_75: 0.3250, bbox_mAP_s: 0.1600, bbox_mAP_m: 0.3370, bbox_mAP_l: 0.3970, bbox_mAP_copypaste: 0.304 0.455 0.325 0.160 0.337 0.397

shixingy commented 1 year ago

但主要是前四十类的精度下降太多了呀，本来就是增量学习，前四十类降这么多，这。。。

------------------ 原始邮件 ------------------ 发件人: "Hi-FT/ERD" @.>; 发送时间: 2023年4月5日(星期三) 晚上6:44 @.>; @.**@.>; 主题: Re: [Hi-FT/ERD] Bunch of Issues (Issue #7)

增量训练后，精度和文中的完全不符，这是因为什么，有人复现出来的精度，前四十类的map没有下降很多么

我训练出来的结果好像比你的还好一点。没找到具体问题是出在哪里？ Epoch(val) [12][13649] bbox_mAP: 0.3040, bbox_mAP_50: 0.4550, bbox_mAP_75: 0.3250, bbox_mAP_s: 0.1600, bbox_mAP_m: 0.3370, bbox_mAP_l: 0.3970, bbox_mAP_copypaste: 0.304 0.455 0.325 0.160 0.337 0.397

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jingong commented 1 year ago

但主要是前四十类的精度下降太多了呀，本来就是增量学习，前四十类降这么多，这。。。 … ------------------ 原始邮件 ------------------ 发件人: "Hi-FT/ERD" @.>; 发送时间: 2023年4月5日(星期三) 晚上6:44 @.>; @.**@.>; 主题: Re: [Hi-FT/ERD] Bunch of Issues (Issue #7) 增量训练后，精度和文中的完全不符，这是因为什么，有人复现出来的精度，前四十类的map没有下降很多么我训练出来的结果好像比你的还好一点。没找到具体问题是出在哪里？ Epoch(val) [12][13649] bbox_mAP: 0.3040, bbox_mAP_50: 0.4550, bbox_mAP_75: 0.3250, bbox_mAP_s: 0.1600, bbox_mAP_m: 0.3370, bbox_mAP_l: 0.3970, bbox_mAP_copypaste: 0.304 0.455 0.325 0.160 0.337 0.397 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

我实验过用前10类增量第11----15类、11----20类，增量完事之后，前10类的mAP0.5:0.95基本上都是下降10%。感觉像是代码中哪里没修改好，但是又找不到问题。。。

shixingy commented 1 year ago

增量训练后，精度和文中的完全不符，这是因为什么，有人复现出来的精度，前四十类的map没有下降很多么

我训练出来的结果好像比你的还好一点。没找到具体问题是出在哪里？ Epoch(val) [12][13649] bbox_mAP: 0.3040, bbox_mAP_50: 0.4550, bbox_mAP_75: 0.3250, bbox_mAP_s: 0.1600, bbox_mAP_m: 0.3370, bbox_mAP_l: 0.3970, bbox_mAP_copypaste: 0.304 0.455 0.325 0.160 0.337 0.397

你这次训练的前四十类精度是多少？我看前面有提到说mmdet的问题，你下载是从官方下的么，然后还想请教一下你的整个有对什么修改吧，我这个不是下降10%了，这是下降20%多。十分感谢

jingong commented 1 year ago

增量训练后，精度和文中的完全不符，这是因为什么，有人复现出来的精度，前四十类的map没有下降很多么

我训练出来的结果好像比你的还好一点。没找到具体问题是出在哪里？ Epoch(val) [12][13649] bbox_mAP: 0.3040, bbox_mAP_50: 0.4550, bbox_mAP_75: 0.3250, bbox_mAP_s: 0.1600, bbox_mAP_m: 0.3370, bbox_mAP_l: 0.3970, bbox_mAP_copypaste: 0.304 0.455 0.325 0.160 0.337 0.397

你这次训练的前四十类精度是多少？我看前面有提到说mmdet的问题，你下载是从官方下的么，然后还想请教一下你的整个有对什么修改吧，我这个不是下降10%了，这是下降20%多。十分感谢

前40类的精度大概是0.3 我就是使用的作者发布的代码，环境我使用的是： python 3.6.9 torch 1.7.0+cu110 mmcv-full 1.7.1 mmdet 2.10.0 因为显卡的支持问题，我没能安装作者提供的环境版本

shixingy commented 1 year ago

好的非常感谢，我的环境也差不多，我是用的Python 3.7，torch 1.7.0+cu110 mmcv-full 1.2.7 mmdet 2.10.0,能加个微信跟您细聊么 sxy15516613690

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2023年4月5日(星期三) 晚上7:32 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [Hi-FT/ERD] Bunch of Issues (Issue #7)

增量训练后，精度和文中的完全不符，这是因为什么，有人复现出来的精度，前四十类的map没有下降很多么

我训练出来的结果好像比你的还好一点。没找到具体问题是出在哪里？ Epoch(val) [12][13649] bbox_mAP: 0.3040, bbox_mAP_50: 0.4550, bbox_mAP_75: 0.3250, bbox_mAP_s: 0.1600, bbox_mAP_m: 0.3370, bbox_mAP_l: 0.3970, bbox_mAP_copypaste: 0.304 0.455 0.325 0.160 0.337 0.397

你这次训练的前四十类精度是多少？我看前面有提到说mmdet的问题，你下载是从官方下的么，然后还想请教一下你的整个有对什么修改吧，我这个不是下降10%了，这是下降20%多。十分感谢

我就是使用的作者发布的代码，环境我使用的是： python 3.6.9 torch 1.7.0+cu110 mmcv-full 1.7.1 mmdet 2.10.0 因为显卡的支持问题，我没能安装作者提供的环境版本

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

oldfashionyyf commented 1 year ago

AssertionError: The num_classes (40) in GFLHead of MMDataParallel does not matches the length of CLASSES 80) in CocoDataset 请问在训练前40类的时候出现这个问题如何解决

yycweng commented 1 year ago

@jingong @Parsifal133 @xinlong007 @ 想问下，各位大佬

我环境是这么配置的：先安装好最新的mmdet（2.26.0），然后将ERS中的新增组件加入其中不知道这种配置方法是否正确？

我出现了和这位 @xinlong007 一样的情况：训练前四十类“configs/gfl_incre/gfl_r50_fpn_1x_coco_first_40_cats.py”， ap正常然而在前者基础上训练 configs/gfl_incre/gfl_r50_fpn_1x_coco_first_40_incre_last_40_cats.py， ap非常小接近于0，但是loss没有出现nan 我是直接采用了ERS中提供的对训练json的划分代码想问下大佬们是否对这两个config以及ERS某些代码进行过修改？

我遇到了相同的问题，请问您解决了吗？

miaomiaojun122 commented 12 hours ago

When I do 40+20+20 multi-step training, I always report mistakes, tell me unexpected key in source state_dict: ori_model.backbone.conv1.weight, ori_model.backbone.bn1.weight, ori_model.backbone.bn1.bias, ori_model.backbone.bn1.running_mean, ori_model.backbone.bn1.running_var, ori_model.backbone.bn1.num_batches_tracked, ori_model.backbone.layer1.0.conv1.weight, ori_model.backbone.layer1.0.bn1.weight, ori_model.backbone.layer1.0.bn1.bias, ori_model.backbone.layer1.0.bn1.running_mean, ori_model.backbone.layer1.0.bn1.running_var, ori_model.backbone.layer1.0.bn1.num_batches_tracked, ori_model.backbone.layer1.0.conv2.weight, ori_model.backbone.layer1.0.bn2.weight, ori_model.backbone.layer1.0.bn2.bias, ori_model.backbone.layer1.0.bn2.running_mean, ori_model.backbone.layer1.0.bn2.running_var, ori_model.backbone.layer1.0.bn2.num_batches_tracked, ori_model.backbone.layer1.0.conv3.weight, ori_model.backbone.layer1.0.bn3.weight, ori_model.backbone.layer1.0.bn3.bias, ori_model.backbone.layer1.0.bn3.running_mean, ori_model.backbone.layer1.0.bn3.running_var, ori_model.backbone.layer1.0.bn3.num_batches_tracked, ori_model.backbone.layer1.0.downsample.0.weight, ...... How to do multi-step training thank you

miaomiaojun122 commented 12 hours ago

各位大佬，怎么进行多步训练呀，谢谢我在训练40+20+20 ，进行最后一步训练的时候，加载权重总会报错 unexpected key in source state_dict: ori_model.backbone.conv1.weight, ori_model.backbone.bn1.weight, ori_model.backbone.bn1.bias, ori_model.backbone.bn1.running_mean, ori_model.backbone.bn1.running_var, ori_model.backbone.bn1.num_batches_tracked, ori_model.backbone.layer1.0.conv1.weight, ori_model.backbone.layer1.0.bn1.weight, ori_model.backbone.layer1.0.bn1.bias, ori_model.backbone.layer1.0.bn1.running_mean, ori_model.backbone.layer1.0.bn1.running_var, ori_model.backbone.layer1.0.bn1.num_batches_tracked, ori_model.backbone.layer1.0.conv2.weight, ori_model.backbone.layer1.0.bn2.weight, ori_model.backbone.layer1.0.bn2.bias, ori_model.backbone.layer1.0.bn2.running_mean, ori_model.backbone.layer1.0.bn2.running_var, ori_model.backbone.layer1.0.bn2.num_batches_tracked, ori_model.backbone.layer1.0.conv3.weight, ori_model.backbone.layer1.0.bn3.weight, ori_model.backbone.layer1.0.bn3.bias, ori_model.backbone.layer1.0.bn3.running_mean, ori_model.backbone.layer1.0.bn3.running_var, ori_model.backbone.layer1.0.bn3.num_batches_tracked, ori_model.backbone.layer1.0.downsample.0.weight, ...... 权重不匹配

Hi-FT / ERD

Bunch of Issues #7

2023-01-06 10:24:44,896 - mmdet - INFO - Environment info: