Show-han / Zeroshot_REC

Official code for Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions (CVPR 2024)
Apache License 2.0
19 stars 0 forks source link

你好,你们的工作自称是“zero-shot”可是却需要训练,跟 ReCLIP 的 setting 完全不一致啊,这该怎么解释?难道审稿的时候没有审稿人质疑? #6

Open linhuixiao opened 2 months ago

linhuixiao commented 2 months ago

你好,你们的工作自称是“zero-shot”可是却需要训练,跟 ReCLIP 的 setting 完全不一致啊,这该怎么解释?难道审稿的时候没有审稿人质疑?论文当中对训练的方式和数据也没有解释清楚,还故意放补充材料。

your work is labeled as "zero-shot," but it requires training, which contradicts the ReCLIP setting. How do you explain this? Didn't the reviewer question it during manuscript review?

Show-han commented 2 months ago

First, even without fine-tuning CLIP, our model already outperforms ReCLIP on most tasks. Second, "zero-shot" does not mean that no training is involved at all. The training described in this paper is intended to enhance CLIP's understanding of visual relationships, and we have excluded all in-distribution data samples in MSCOCO, which has been elaborated in Section 3.3.

linhuixiao commented 1 month ago

我依然对你们的设置不太信服。

1、如果是对标 Reclip,那本文应该强调免训练的结果,而不是额外再训练的结果; 2、比较标准的需要训练的 Zero-shot 的定义出自 ZSGNet (zero-shot grounding of objects from natural language queries,ICCV‘19),本文所述也和其不一样。如果使用了额外的MSCOCO的数据,某种程度上本文应该类似于弱监督设置而不是零样本设置。 3、Reclip 可以认为是这种免训练 zero-shot 的提出者,如果以“训练时可以使用 RefCOCO 以外的数据” (或者说,“excluded all in-distribution data samples in MSCOCO”) 作为训练的设置, 那么是否以外的任何数据集就都可以用?(such as COCO caption, CC3M, CC12M, SBU, LAION, etc.) 毕竟都满足“excluded all in-distribution data samples in MSCOCO”。

这样随意更改实验设置,容易引起不公平的比较,把后续的游戏规则给玩坏了。我认为这样的定义会给未来实验不公平带了不好的先例,可能不利于visual grounding 的发展的规范化。

=============== English version

I'm still not convinced by your setting.

  1. If the benchmark is Reclip, then this paper should emphasize the results of training-free rather than additional retraining.

  2. The more standard definition of a Zero-shot that requires training comes from ZSGNet (zero-shot grounding of objects from natural language queries, ICCV '19), which is also different in this article. If additional training data is used, to some extent this paper should resemble a weakly supervised setting rather than a zero-shot setting.

  3. Reclip can be considered as the originator of this training-free zero-shot setting. If "training can use data other than RefCOCO" (or "exclude all in-distribution data samples in MSCOCO") as a training setting in the zero-shot scenario, are there any other datasets available? (such as COCO caption, CC3M, CC12M, SBU, LAION, etc.) After all, it is satisfied with the condition of "excluding all in-distribution data samples in MSCOCO".

This arbitrary change of the experimental settings easily leads to unfair comparisons and breaks subsequent rules of the game. In my opinion, such a definition will set a bad precedent for future experiment fairness and may not contribute to normalizing visual grounding development.