我觉得这篇文章特别好

hhaAndroid commented 11 months ago

感谢作者，还开源了。

首先这篇文章解决了两个主要问题：

ViT 如何用到大分辨率上，这个问题不解决，不管多强的 LLM 估计都够呛
LLM 如何进行和视觉模型联合从而发挥两者的优势

相比于现阶段大家都一股脑的多模态拉成序列丢到 LLM 中进行自回归来说，本文做法明显是更可取的(这里不是说无脑 MLLM 就一定不行，只是说现阶段还存在很多问题)

LLM 天生就适合做识别任务，这算是他的强项，但是在没有提供高分辨率情况下区域理解或者提取啥的可能就不是他的强项了，此时就需要互补。在现实应用中有不少人会对 CV 的识别结果再次经过 MLLM 进行概念纠正，也是这个思路，不过本文做成了一个端到端方案。

本方案相比于目前主流的开放词汇检测OVD 来说，我觉得是一个非常好的思路，至少在思路上是秒杀 OVD 的，因为 OVD 要构建词汇表，这注定应用范围很有限，但是也有一定应用场景，因为本文算法即可以做 OVD 也可以做没有词汇表的概念识别，我觉得这个思路是我最喜欢的。

至于性能上好像没有和 OVD 拉开差距，原因可能有几点：

是否有更好的 ViT 应用于高分辨率方案，目前的做法只是引入一个全局位置编码，是否是足够的呢？
LLM 性能限制，目前的 LLM 是固定的，是否能适应新的序列向量输入呢？但是全部打开肯定也是不合理的，因为训练的概念太少了，还不足以训练 LLM
训练数据的概念太少了，只有几个公开的数据集，所覆盖的概念太少了。对于这种训练任务，应该可以采用大量模拟数据，重点是要尽可能覆盖更多的视觉概念，最好是短语，而不是一个一个单词。
评测不够充足，这个算法的最大价值是识别 anything，如果只是在 coco 等特定数据集上进行测试其实有点不公平，感觉可以主打 zero-shot，特别是在评测上引入 zero-shot，这样可以更全面的评测这个模型

我觉得第三点和第四点最重要，特别是现在大模型大数据时代来说，在大数据加持下模型结构本身的性能差异应该不大。

当然目前方案对于每一个 mask 都要运行一遍 LLM，推理成本是蛮高的。

以上只是我个人的拙见，目前代码还没有细看，等有了更多的体会再来更新。也欢迎大家在这里交流！！！

再次感谢作者

cornettoyu commented 11 months ago

Thanks a lot for you highly acknowledgment to our work and sharing your thoughts!

Regarding your questions/comments:

We adopt a simple strategy to adapt the CLIP-ViT to high-resolution image in a sliding-windows manner, with a global positional embedding to compensate the missing location information. This is an efficient and effective solution, especially considering that the CLIP-ViT is frozen . I believe that could be better solutions especially if we may train/fine-tune the CLIP-ViT at high-resolution, which however, typically requires large-scale dataset and computations.
Similar to 1, Fine-tuning/LORA fine-tuning LLM could require more data or computations, while this project aims at a preliminary exploration in the open-ended visual recognition. But they are expected to give a performance boost.
and 4. Yes we also find that existing datasets are kind of limited, though we have adopted most existing segmentation datasets especially with large vocabularies to ensure the diversity in the training targets. Besides, we have tried our best to evaluate the model in our own way (the reported ACC and NIV) or well-established benchmarks (closed-vocab and open-vocab). We also considered using text similarity or CLIP score as an open-ended measurement yet found text alone is not reliable without image context and CLIP score is not distinguishable. A better benchmark designed for the open-ended recognition purpose is definitely much more helpful.
We explored to predict multiple mask in a single forwarding process in the early stage of this project. As a result, we found the model has difficulty predicting in order for multiple masks, and tend to repeat when there any too many masks (potentially due to that the LLM is frozen and thus easier to repeat). Therefore we adopt single mask manner at the final design. Actually it may not be as slow as expected, as the image feature can be cached and shared (already implemented in our code-base), and the LLM part also has a similar computational costs considering that the sequence length of predicting all masks together or one-by-one are similar (but this may require we apply kv cache to the instructions as well, which is not implemented yet). So the real not avoidable costs comes from the multiple forwarding of the MaskQFormer, which may not be that larger compared to other component. But still, we totally agree that the inference speed can be further improved no matter though design level or implementation level.

Thanks again for your comments, and please let us know if you have any questions :)

hhaAndroid commented 11 months ago

Thank you very much for your patient response. I think you did a great job. It would be even better to add a zero-shot evaluation, and the evaluation should include category names and REC phrases to highlight the strengths of LLM and the algorithm's generalization. The NIV metric in the paper does indeed have some reference value.

bytedance / OmniScient-Model

我觉得这篇文章特别好 #2