baaivision / EVA

EVA Series: Visual Representation Fantasies from BAAI
MIT License
2.32k stars 167 forks source link

I would like to know the text input capacity of eva-clip-18b? To my knowledge, OpenAI's CLIP requires less than 20 tokens / 我想了解一下 eva-clip-18b 文本输入容量是多少?据我了解,OpenAI 的 CLIP 的话低于20 个 token #165

Open gg22mm opened 3 months ago

gg22mm commented 3 months ago

I would like to know the text input capacity of eva-clip-18b? To my knowledge, OpenAI's CLIP requires less than 20 tokens OpenAI's CLIP has two major shortcomings:

  1. The text input capacity is very limited. At most, it only supports input of 77 tokens. According to LongCLIP's experiment, its effective input does not exceed 20 tokens.
  2. Poor performance in pure text retrieval. There are two main reasons: firstly, the training objective of the CLIP model is to align text and images, without specialized optimization for pure text retrieval. Secondly, the training data for the CLIP model mainly consists of relatively short texts, making it difficult to generalize to broader text retrieval scenarios. I don't know if eva-clip-18b has any restrictions like openia-clip for text retrieval?

我想了解一下 eva-clip-18b 文本输入容量是多少?据我了解,OpenAI 的 CLIP 的话低于20 个 token

OpenAI 的 CLIP 存在两大短板:

  1. 文本输入容量非常有限。最多仅支持 77 个 token 的输入,根据 LongCLIP 的实验,实际上其有效输入不超过 20 个 token。

  2. 在纯文本检索中表现不佳。主要原因有两点:首先,CLIP 模型的训练目标是对齐文本和图像,没有针对纯文本检索进行专门优化。其次,CLIP 模型的训练数据主要由相对较短的文本组成,难以泛化到更广阔的文本检索场景。

不知道 eva-clip-18b做为文本检索有没有上面象openia-clip的限制呢?