Open X00123 opened 2 years ago
Hi,
Thank you for your interests. Here zs_weight_dim
is the dimension of the CLIP embedding, which is always 512
. The classification layer zs_weight
is in the shape of zs_weight_dim x num_classes
. When loading a custom vocabulary, we changed the number of classes, but not zs_weight_dim
. Hope that helps.
Best, Xingyi
Oh I got it. Thanks for your answer!
And I still have some more question. Can you help me with the following?
I try to use Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth
as pretrained model, and training my own dataset which has 13 labels, it shows the following warning
WARNING [08/23 10:09:56 fvcore.common.checkpoint]: Skip loading parameter 'roi_heads.box_predictor.0.cls_score.zs_weight' to the model due to incompatible shapes: (512, 22048) in the checkpoint but (512, 14) in the model! You might want to double check if this is expected
WARNING [08/23 10:09:56 fvcore.common.checkpoint]: Skip loading parameter 'roi_heads.box_predictor.1.cls_score.zs_weight' to the model due to incompatible shapes: (512, 22048) in the checkpoint but (512, 14) in the model! You might want to double check if this is expected
WARNING [08/23 10:09:56 fvcore.common.checkpoint]: Skip loading parameter 'roi_heads.box_predictor.2.cls_score.zs_weight' to the model due to incompatible shapes: (512, 22048) in the checkpoint but (512, 14) in the model! You might want to double check if this is expected
According to my understanding, zs_weight is CLIP text embedding so actually I can ignore this warning, did I correct?
Also, I want use Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth
as pretrained model to try some few-shot learning method, so I collect data for 13 animal detection labels, each label has about 1000 images, I modify your config and make it like this
_BASE_: "Base-C2_L_R5021k_640b64_4x.yaml"
MODEL:
WEIGHTS: "models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth"
DYNAMIC_CLASSIFIER: False
ROI_BOX_HEAD:
USE_ZEROSHOT_CLS: True
IMAGE_LABEL_LOSS: 'max_size'
# ZEROSHOT_WEIGHT_PATH: 'datasets/metadata/animal_detection_clip.npy'
ZEROSHOT_WEIGHT_PATH: 'datasets/metadata/lvis-21k_clip_a+cname.npy'
USE_FED_LOSS: False # Federated loss is enabled when DYNAMIC_CLASSIFIER is on
ROI_HEADS:
NUM_CLASSES: 22047
# NUM_CLASSES: 13
BACKBONE:
NAME: build_swintransformer_fpn_backbone
SWIN:
SIZE: B-22k
FPN:
IN_FEATURES: ["swin1", "swin2", "swin3"]
SOLVER:
MAX_ITER: 3000
IMS_PER_BATCH: 16
BASE_LR: 0.0001
WARMUP_ITERS: 1000
WARMUP_FACTOR: 0.001
DATASETS:
TRAIN: ("动物检测part02",)
WITH_IMAGE_LABELS: False
FP16: False
In detail, I modify MAX_ITER
from 180000 to 3000 because I have much less data than ImageNet, and IMS_PER_BATCH
to 16 because 32 batch-size with get OOM, I also setDYNAMIC_CLASSIFIER
and WITH_IMAGE_LABELS
to False
I tried two methods for training, one is create new CLIP text embedding weight which only has my own 13 labels and use it as new zs_weight, and the other is convert my own data to lvis_22k labels format and use lvis-21k_clip_a+cname.npy
as zs_weight, then I training the model using above hyperparameters, but both give worse result than basline. In detail, first method get higher recall but much lower precision in test dataset, second method get both recall and precision drop.
Do you have any advice about how to finetune my own small dataset on your pretrained model?
I also confuse with the DYNAMIC_CLASSIFIER function, Is it some kind of sampling stategy? i found using code but can't figure out what it did.
And in zero_shot_classifier.py
, you lead the bias must be less than 0 in here, is this a trick to make training more stable? I did't find similar stategy in raw CLIP code.
Hope you can solve my doubts when you have free time~
Thanks~
Hi,
Sorry for my delayed reply.
MODEL.META_ARCHITECTURE ProposalNetwork
). If this is the case, I would suggest co-training with LVIS. You can refer to this to find out how to setup co-training.K
follows FedLoss (search use_fed_loss
in code for details).Best, Xingyi
Hope you can solve my doubts when you have free time~
Thanks~
请问你是怎么进行zore-shot的训练的呢?具体过程可以分享一下吗?或者参考了什么教程呢?自己的数据集怎么准备还有训练代码和参数怎么调整?谢谢
请问你是怎么进行zore-shot的训练的呢?具体过程可以分享一下吗?或者参考了什么教程呢?自己的数据集怎么准备还有训练代码和参数怎么调整?谢谢
zero-shot指的是零样本学习,意思就是完全不训练直接用来做推理,我猜你是想用少量图片进行训练做few-shot learning?
最关键点是需要把自己准备的数据集和lvis数据集混在一起训才能有效果
具体过程上,如果你的类别和lvis里的类别相同,那么你可以不修改直接将你的数据集和lvis数据集混在一起训练就可以;如果你的标签类别和lvis里的类别不相同,那么你需要生成一个lvis标签类别+你自己标签类别的新clip特征并进行训练,特征生成的方法可以参考作者的文档https://github.com/facebookresearch/Detic/blob/main/datasets/README.md
也没有什么太多的教程,都是对着作者给的文档以及detectron2的使用说明一点点踩坑踩出来
关于数据集准备,作者是在detectron2框架上实现的代码,所以数据集的组织可以参考detectron2对应的文档;训练代码和参数配置可以参考我上面提问那一条里贴的配置文档,你可以根据自己的情况进行调整
希望能够帮助到你~
请问你是怎么进行zore-shot的训练的呢?具体过程可以分享一下吗?或者参考了什么教程呢?自己的数据集怎么准备还有训练代码和参数怎么调整?谢谢
zero-shot指的是零样本学习,意思就是完全不训练直接用来做推理,我猜你是想用少量图片进行训练做few-shot learning?
最关键点是需要把自己准备的数据集和lvis数据集混在一起训才能有效果
具体过程上,如果你的类别和lvis里的类别相同,那么你可以不修改直接将你的数据集和lvis数据集混在一起训练就可以;如果你的标签类别和lvis里的类别不相同,那么你需要生成一个lvis标签类别+你自己标签类别的新clip特征并进行训练,特征生成的方法可以参考作者的文档https://github.com/facebookresearch/Detic/blob/main/datasets/README.md
也没有什么太多的教程,都是对着作者给的文档以及detectron2的使用说明一点点踩坑踩出来
关于数据集准备,作者是在detectron2框架上实现的代码,所以数据集的组织可以参考detectron2对应的文档;训练代码和参数配置可以参考我上面提问那一条里贴的配置文档,你可以根据自己的情况进行调整
希望能够帮助到你~
我想请问您,如果只是想使预训练的Detic进行目标特征提取工作,应该怎么做呢?
Hi,
I read your whole code and think it’s a excellect work
But I still have some doubts, can you help me with some questions?
From what I understand, detection network generate proposal region and then convert it to clip image embedding with this code, but when we use custom vocabulary, clip will create a new text embedding which shape is inconsistent with pre-trained model, that will lead zs_weight_dim in here is different from pre-trained model so there should be an error when using nn.Linear, but nothing happens when I use custom vocabulary
Did I mistake something?