facebookresearch / Detic

Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".
Apache License 2.0
1.86k stars 211 forks source link

How zero shot work? #73

Open X00123 opened 2 years ago

X00123 commented 2 years ago

Hi,

I read your whole code and think it’s a excellect work

But I still have some doubts, can you help me with some questions?

From what I understand, detection network generate proposal region and then convert it to clip image embedding with this code, but when we use custom vocabulary, clip will create a new text embedding which shape is inconsistent with pre-trained model, that will lead zs_weight_dim in here is different from pre-trained model so there should be an error when using nn.Linear, but nothing happens when I use custom vocabulary

Did I mistake something?

xingyizhou commented 2 years ago

Hi,

Thank you for your interests. Here zs_weight_dim is the dimension of the CLIP embedding, which is always 512. The classification layer zs_weight is in the shape of zs_weight_dim x num_classes. When loading a custom vocabulary, we changed the number of classes, but not zs_weight_dim. Hope that helps.

Best, Xingyi

X00123 commented 2 years ago

Oh I got it. Thanks for your answer!

And I still have some more question. Can you help me with the following?

Hope you can solve my doubts when you have free time~

Thanks~

xingyizhou commented 2 years ago

Hi,

Sorry for my delayed reply.

  1. Yes you can ignore the warning.
  2. If you own training data is small, one possible reason might be training on your data alone can destroy the region proposal network. You can check this by checking the proposal recall (setting MODEL.META_ARCHITECTURE ProposalNetwork). If this is the case, I would suggest co-training with LVIS. You can refer to this to find out how to setup co-training.
  3. What it is doing is making the classifier (z_weight) not C x D (C is the number of classes of the dataset; for LVIS C=1204, for IN-21K C>21000) but K x D (K=50) during training. This makes training on IN-21K efficient. How to sample the K follows FedLoss (search use_fed_loss in code for details).
  4. Yes this is to make the initial output to be small. E.g., focal loss sets bias = -4.6 so that the initial prediction is sigmoid(-4.6)=0.01. I believe I didn't find it useful in Detic.

Best, Xingyi

Yummy-Lee commented 1 year ago

Hope you can solve my doubts when you have free time~

Thanks~

请问你是怎么进行zore-shot的训练的呢?具体过程可以分享一下吗?或者参考了什么教程呢?自己的数据集怎么准备还有训练代码和参数怎么调整?谢谢

X00123 commented 1 year ago

请问你是怎么进行zore-shot的训练的呢?具体过程可以分享一下吗?或者参考了什么教程呢?自己的数据集怎么准备还有训练代码和参数怎么调整?谢谢

zero-shot指的是零样本学习,意思就是完全不训练直接用来做推理,我猜你是想用少量图片进行训练做few-shot learning?

最关键点是需要把自己准备的数据集和lvis数据集混在一起训才能有效果

具体过程上,如果你的类别和lvis里的类别相同,那么你可以不修改直接将你的数据集和lvis数据集混在一起训练就可以;如果你的标签类别和lvis里的类别不相同,那么你需要生成一个lvis标签类别+你自己标签类别的新clip特征并进行训练,特征生成的方法可以参考作者的文档https://github.com/facebookresearch/Detic/blob/main/datasets/README.md

也没有什么太多的教程,都是对着作者给的文档以及detectron2的使用说明一点点踩坑踩出来

关于数据集准备,作者是在detectron2框架上实现的代码,所以数据集的组织可以参考detectron2对应的文档;训练代码和参数配置可以参考我上面提问那一条里贴的配置文档,你可以根据自己的情况进行调整

希望能够帮助到你~

FUIGUIMURONG commented 1 year ago

请问你是怎么进行zore-shot的训练的呢?具体过程可以分享一下吗?或者参考了什么教程呢?自己的数据集怎么准备还有训练代码和参数怎么调整?谢谢

zero-shot指的是零样本学习,意思就是完全不训练直接用来做推理,我猜你是想用少量图片进行训练做few-shot learning?

最关键点是需要把自己准备的数据集和lvis数据集混在一起训才能有效果

具体过程上,如果你的类别和lvis里的类别相同,那么你可以不修改直接将你的数据集和lvis数据集混在一起训练就可以;如果你的标签类别和lvis里的类别不相同,那么你需要生成一个lvis标签类别+你自己标签类别的新clip特征并进行训练,特征生成的方法可以参考作者的文档https://github.com/facebookresearch/Detic/blob/main/datasets/README.md

也没有什么太多的教程,都是对着作者给的文档以及detectron2的使用说明一点点踩坑踩出来

关于数据集准备,作者是在detectron2框架上实现的代码,所以数据集的组织可以参考detectron2对应的文档;训练代码和参数配置可以参考我上面提问那一条里贴的配置文档,你可以根据自己的情况进行调整

希望能够帮助到你~

我想请问您,如果只是想使预训练的Detic进行目标特征提取工作,应该怎么做呢?