How zero shot work? - Githubissues

X00123 commented 2 years ago

Hi，

I read your whole code and think it’s a excellect work

But I still have some doubts, can you help me with some questions?

From what I understand, detection network generate proposal region and then convert it to clip image embedding with this code, but when we use custom vocabulary, clip will create a new text embedding which shape is inconsistent with pre-trained model, that will lead zs_weight_dim in here is different from pre-trained model so there should be an error when using nn.Linear, but nothing happens when I use custom vocabulary

Did I mistake something？

xingyizhou commented 2 years ago

Hi,

Thank you for your interests. Here zs_weight_dim is the dimension of the CLIP embedding, which is always 512. The classification layer zs_weight is in the shape of zs_weight_dim x num_classes. When loading a custom vocabulary, we changed the number of classes, but not zs_weight_dim. Hope that helps.

Best, Xingyi

X00123 commented 2 years ago

Oh I got it. Thanks for your answer！

And I still have some more question. Can you help me with the following?

I try to use Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth as pretrained model, and training my own dataset which has 13 labels, it shows the following warning

WARNING [08/23 10:09:56 fvcore.common.checkpoint]: Skip loading parameter 'roi_heads.box_predictor.0.cls_score.zs_weight' to the model due to incompatible shapes: (512, 22048) in the checkpoint but (512, 14) in the model! You might want to double check if this is expected

WARNING [08/23 10:09:56 fvcore.common.checkpoint]: Skip loading parameter 'roi_heads.box_predictor.1.cls_score.zs_weight' to the model due to incompatible shapes: (512, 22048) in the checkpoint but (512, 14) in the model! You might want to double check if this is expected

WARNING [08/23 10:09:56 fvcore.common.checkpoint]: Skip loading parameter 'roi_heads.box_predictor.2.cls_score.zs_weight' to the model due to incompatible shapes: (512, 22048) in the checkpoint but (512, 14) in the model! You might want to double check if this is expected

According to my understanding, zs_weight is CLIP text embedding so actually I can ignore this warning, did I correct?

Also, I want use Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth as pretrained model to try some few-shot learning method, so I collect data for 13 animal detection labels, each label has about 1000 images, I modify your config and make it like this
```
_BASE_: "Base-C2_L_R5021k_640b64_4x.yaml"
MODEL:
WEIGHTS: "models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth"
DYNAMIC_CLASSIFIER: False
ROI_BOX_HEAD:
  USE_ZEROSHOT_CLS: True
  IMAGE_LABEL_LOSS: 'max_size'
  # ZEROSHOT_WEIGHT_PATH: 'datasets/metadata/animal_detection_clip.npy'
  ZEROSHOT_WEIGHT_PATH: 'datasets/metadata/lvis-21k_clip_a+cname.npy'
  USE_FED_LOSS: False # Federated loss is enabled when DYNAMIC_CLASSIFIER is on
ROI_HEADS:
  NUM_CLASSES: 22047
  # NUM_CLASSES: 13
BACKBONE:
  NAME: build_swintransformer_fpn_backbone
SWIN:
  SIZE: B-22k
FPN:
  IN_FEATURES: ["swin1", "swin2", "swin3"]
SOLVER:
MAX_ITER: 3000
IMS_PER_BATCH: 16
BASE_LR: 0.0001
WARMUP_ITERS: 1000
WARMUP_FACTOR: 0.001
DATASETS:
TRAIN: ("动物检测part02",)
WITH_IMAGE_LABELS: False
FP16: False
```
In detail, I modify MAX_ITER from 180000 to 3000 because I have much less data than ImageNet, and IMS_PER_BATCH to 16 because 32 batch-size with get OOM, I also setDYNAMIC_CLASSIFIER and WITH_IMAGE_LABELS to False

I tried two methods for training, one is create new CLIP text embedding weight which only has my own 13 labels and use it as new zs_weight, and the other is convert my own data to lvis_22k labels format and use lvis-21k_clip_a+cname.npy as zs_weight, then I training the model using above hyperparameters, but both give worse result than basline. In detail, first method get higher recall but much lower precision in test dataset, second method get both recall and precision drop.

Do you have any advice about how to finetune my own small dataset on your pretrained model?
I also confuse with the DYNAMIC_CLASSIFIER function, Is it some kind of sampling stategy? i found using code but can't figure out what it did.
And in zero_shot_classifier.py, you lead the bias must be less than 0 in here, is this a trick to make training more stable? I did't find similar stategy in raw CLIP code.

Hope you can solve my doubts when you have free time~

Thanks~

xingyizhou commented 2 years ago

Hi,

Sorry for my delayed reply.

Yes you can ignore the warning.
If you own training data is small, one possible reason might be training on your data alone can destroy the region proposal network. You can check this by checking the proposal recall (setting MODEL.META_ARCHITECTURE ProposalNetwork). If this is the case, I would suggest co-training with LVIS. You can refer to this to find out how to setup co-training.
What it is doing is making the classifier (z_weight) not C x D (C is the number of classes of the dataset; for LVIS C=1204, for IN-21K C>21000) but K x D (K=50) during training. This makes training on IN-21K efficient. How to sample the K follows FedLoss (search use_fed_loss in code for details).
Yes this is to make the initial output to be small. E.g., focal loss sets bias = -4.6 so that the initial prediction is sigmoid(-4.6)=0.01. I believe I didn't find it useful in Detic.

Best, Xingyi

Yummy-Lee commented 1 year ago

Hope you can solve my doubts when you have free time~

Thanks~

请问你是怎么进行zore-shot的训练的呢？具体过程可以分享一下吗？或者参考了什么教程呢？自己的数据集怎么准备还有训练代码和参数怎么调整？谢谢

X00123 commented 1 year ago

请问你是怎么进行zore-shot的训练的呢？具体过程可以分享一下吗？或者参考了什么教程呢？自己的数据集怎么准备还有训练代码和参数怎么调整？谢谢

zero-shot指的是零样本学习，意思就是完全不训练直接用来做推理，我猜你是想用少量图片进行训练做few-shot learning？

最关键点是需要把自己准备的数据集和lvis数据集混在一起训才能有效果

具体过程上，如果你的类别和lvis里的类别相同，那么你可以不修改直接将你的数据集和lvis数据集混在一起训练就可以；如果你的标签类别和lvis里的类别不相同，那么你需要生成一个lvis标签类别+你自己标签类别的新clip特征并进行训练，特征生成的方法可以参考作者的文档https://github.com/facebookresearch/Detic/blob/main/datasets/README.md

也没有什么太多的教程，都是对着作者给的文档以及detectron2的使用说明一点点踩坑踩出来

关于数据集准备，作者是在detectron2框架上实现的代码，所以数据集的组织可以参考detectron2对应的文档；训练代码和参数配置可以参考我上面提问那一条里贴的配置文档，你可以根据自己的情况进行调整

希望能够帮助到你~

FUIGUIMURONG commented 1 year ago

请问你是怎么进行zore-shot的训练的呢？具体过程可以分享一下吗？或者参考了什么教程呢？自己的数据集怎么准备还有训练代码和参数怎么调整？谢谢

zero-shot指的是零样本学习，意思就是完全不训练直接用来做推理，我猜你是想用少量图片进行训练做few-shot learning？

最关键点是需要把自己准备的数据集和lvis数据集混在一起训才能有效果

具体过程上，如果你的类别和lvis里的类别相同，那么你可以不修改直接将你的数据集和lvis数据集混在一起训练就可以；如果你的标签类别和lvis里的类别不相同，那么你需要生成一个lvis标签类别+你自己标签类别的新clip特征并进行训练，特征生成的方法可以参考作者的文档https://github.com/facebookresearch/Detic/blob/main/datasets/README.md

也没有什么太多的教程，都是对着作者给的文档以及detectron2的使用说明一点点踩坑踩出来

关于数据集准备，作者是在detectron2框架上实现的代码，所以数据集的组织可以参考detectron2对应的文档；训练代码和参数配置可以参考我上面提问那一条里贴的配置文档，你可以根据自己的情况进行调整

希望能够帮助到你~

我想请问您，如果只是想使预训练的Detic进行目标特征提取工作，应该怎么做呢？

facebookresearch / Detic

How zero shot work? #73