The performance of Theia in contrast to individual VFMs

ThomasCai35 commented 2 weeks ago

Thank you very much for your excellent work! We had already run the model by using the demo and found out that the ability of Theia model on feature extraction visualization was not as good as individual VFMs. We wondered why this result was presented. After reading this paper, we deemed that maybe the Theia model would outperform all the other individual VFMs in this aspect. I wondered if we had misunderstanding of the paper.

elicassion commented 2 weeks ago

We hope you have enjoyed playing with the model! I believe this is a misundesrstanding of our paper, but I think there are important and interesting open problems behind it. I appreciate that you bring up this discussion and I'd like to share some thoughts.

Theia - for Robot Learning

We build Theia to improve visual representations for robot learning, with the hypothesis that diverse visual understanding concepts would enrich the representation and be useful for robot tasks. And we confirm it by our experiments. Being stronger than teacher models on their original tasks is exciting, but it's not the main goal of Theia now.

Visualization

The visualization is using original decoder heads except DINOv2. Since the predicted representations are different than the originals, decoder head could not process it well. That's to say, the learning is not perfect.

Application to Vision Tasks

Theia can be definitely used for vision tasks, by using either the Theia-representation or predicted representations. But we don’t claim that current Theia outperforms all teacher models on all vision tasks. There are several possible reasons:

Much much smaller training dataset, smaller model size, and smaller resolution compared to those teacher models.
The unknown negative effect from some teacher models which we don’t have a clear clue to study it. For example, adding SAM teacher is detrimental. Although this observation is based on robot learning performance, we found one work also reports the similar thing. We would like to refer to another research Cambrian-1 [1]. In Figure 6 of [1], they show the evaluation of using pre-trained visual representations on other different Multimodal LLM tasks. SAM’s representation is also not doing well. The negative effect could bring down the vision task performance.
I would like to make a claim that vision and robot tasks are different. There are gaps. In my opinion, supervised model for vision tasks may "overfit" to the specific task to get superior performance, but lose the general vision understanding capability for a bit. I personally feel that's why DINOv2 is the top on robot learning performance among all foundation model tested. DINOv2 is from self-supervision so it is more general. Theia is not supervised on any vision task so I won't surprise it perform worse than teacher models on vision tasks. Unfortunately, at the moment we don't have further angles to study the gap between robot learning needs and vision task needs. We will be super excited if the community pays attention to the open problem.

Last but not least, the ultimate goal is to have a unified model developed for both, which carries the fundamental visual understanding capabilities. Again I find this is a great challenge and an interesting opportunity for the community.

[1] Tong, Shengbang, et al. "Cambrian-1: A fully open, vision-centric exploration of multimodal llms." arXiv preprint arXiv:2406.16860 (2024).

ThomasCai35 commented 2 weeks ago

Thanks for you reply. It helped a lot! But we still had some problems about your latest reply. I made a conclusion from your paper's figure1 (right) that Theia model's perfromance in robotics tasks was better than all the other models you listed. And we found that Theia model didn't outperform some individual models in vision tasks. Here was my first problem. You mentioned that diverse visual understanding concepts would enrich the representation and be useful for robot tasks. I wondered that if it was the 'overfit' that made DINOv2 model performed better in vision tasks. Meanwhile I wondered that if it was the 'overfit' that made DINOv2 performed worse in the aspect of visual understanding concepts, which led to the result that DINOv2 performed worse than Theia model in robot tasks. Here was my second problem. You mentioned that 'In my opinion, supervised model for vision tasks may "overfit" to the specific task to get superior performance, but lose the general vision understanding capability for a bit. I personally feel that's why DINOv2 is the top on robot learning performance among all foundation model tested.' I got a little bit confused about the link bewteen those two sentences. Here was my understanding: Self-supervision leads to possible overfit. Overfit leads to better vision task performance and worse vision understanding. Worse understanding leads to worse robot task performance. Was that Right? Here was my thrid problem. You mentioned that Theia model was not supervised on any vision task. Was it the main reason that led to the worse vision performance in contrast to individual models? Thanks for your reply.

elicassion commented 2 weeks ago

My argument is that supervised models may have weaker representation for other tasks. These models usually focus on one (type of) vision task, so that one model carries one aspect of visual understanding. For example, ViT is trained on classification task. CLIP is trained by image-text pairs. Thus, the representations from them are biased to classification task or the language respectively. Compared with ViT and CLIP, DINOv2's self-supervision objective is more general than them, and thus resulting in more general representations.
I think training data could outweigh supervision on vision task. RADIO is a great work using the similar distillation approach. It shows better performance than teacher models on some vision tasks.

ThomasCai35 commented 2 weeks ago

Thanks for your timely reply. Your answer helped me tackle the dilemma well. Plus, it gave me some new ideas and insights. Wish you good luck in the future.

bdaiinstitute / theia