Open hhaAndroid opened 11 months ago
Thanks a lot for you highly acknowledgment to our work and sharing your thoughts!
Regarding your questions/comments:
We adopt a simple strategy to adapt the CLIP-ViT to high-resolution image in a sliding-windows manner, with a global positional embedding to compensate the missing location information. This is an efficient and effective solution, especially considering that the CLIP-ViT is frozen . I believe that could be better solutions especially if we may train/fine-tune the CLIP-ViT at high-resolution, which however, typically requires large-scale dataset and computations.
Similar to 1, Fine-tuning/LORA fine-tuning LLM could require more data or computations, while this project aims at a preliminary exploration in the open-ended visual recognition. But they are expected to give a performance boost.
and 4. Yes we also find that existing datasets are kind of limited, though we have adopted most existing segmentation datasets especially with large vocabularies to ensure the diversity in the training targets. Besides, we have tried our best to evaluate the model in our own way (the reported ACC and NIV) or well-established benchmarks (closed-vocab and open-vocab). We also considered using text similarity or CLIP score as an open-ended measurement yet found text alone is not reliable without image context and CLIP score is not distinguishable. A better benchmark designed for the open-ended recognition purpose is definitely much more helpful.
We explored to predict multiple mask in a single forwarding process in the early stage of this project. As a result, we found the model has difficulty predicting in order for multiple masks, and tend to repeat when there any too many masks (potentially due to that the LLM is frozen and thus easier to repeat). Therefore we adopt single mask manner at the final design. Actually it may not be as slow as expected, as the image feature can be cached and shared (already implemented in our code-base), and the LLM part also has a similar computational costs considering that the sequence length of predicting all masks together or one-by-one are similar (but this may require we apply kv cache to the instructions as well, which is not implemented yet). So the real not avoidable costs comes from the multiple forwarding of the MaskQFormer, which may not be that larger compared to other component. But still, we totally agree that the inference speed can be further improved no matter though design level or implementation level.
Thanks again for your comments, and please let us know if you have any questions :)
Thank you very much for your patient response. I think you did a great job. It would be even better to add a zero-shot evaluation, and the evaluation should include category names and REC phrases to highlight the strengths of LLM and the algorithm's generalization. The NIV metric in the paper does indeed have some reference value.
感谢作者,还开源了。
首先这篇文章解决了两个主要问题:
相比于现阶段大家都一股脑的多模态拉成序列丢到 LLM 中进行自回归来说,本文做法明显是更可取的(这里不是说无脑 MLLM 就一定不行,只是说现阶段还存在很多问题)
LLM 天生就适合做识别任务,这算是他的强项,但是在没有提供高分辨率情况下区域理解或者提取啥的可能就不是他的强项了,此时就需要互补。 在现实应用中有不少人会对 CV 的识别结果再次经过 MLLM 进行概念纠正,也是这个思路,不过本文做成了一个端到端方案。
本方案相比于目前主流的开放词汇检测OVD 来说,我觉得是一个非常好的思路,至少在思路上是秒杀 OVD 的,因为 OVD 要构建词汇表,这注定应用范围很有限,但是也有一定应用场景,因为本文算法即可以做 OVD 也可以做没有词汇表的概念识别,我觉得这个思路是我最喜欢的。
至于性能上好像没有和 OVD 拉开差距,原因可能有几点:
我觉得第三点和第四点最重要,特别是现在大模型大数据时代来说,在大数据加持下模型结构本身的性能差异应该不大。
当然目前方案对于每一个 mask 都要运行一遍 LLM,推理成本是蛮高的。
以上只是我个人的拙见,目前代码还没有细看,等有了更多的体会再来更新。也欢迎大家在这里交流!!!
再次感谢作者