bdaiinstitute / theia

Theia: Distilling Diverse Vision Foundation Models for Robot Learning
http://theia.theaiinstitute.com
Other
139 stars 6 forks source link

The performance of Theia in contrast to individual VFMs #9

Open ThomasCai35 opened 2 weeks ago

ThomasCai35 commented 2 weeks ago

Thank you very much for your excellent work! We had already run the model by using the demo and found out that the ability of Theia model on feature extraction visualization was not as good as individual VFMs. We wondered why this result was presented. After reading this paper, we deemed that maybe the Theia model would outperform all the other individual VFMs in this aspect. I wondered if we had misunderstanding of the paper.

elicassion commented 2 weeks ago

We hope you have enjoyed playing with the model! I believe this is a misundesrstanding of our paper, but I think there are important and interesting open problems behind it. I appreciate that you bring up this discussion and I'd like to share some thoughts.

Theia - for Robot Learning

We build Theia to improve visual representations for robot learning, with the hypothesis that diverse visual understanding concepts would enrich the representation and be useful for robot tasks. And we confirm it by our experiments. Being stronger than teacher models on their original tasks is exciting, but it's not the main goal of Theia now.

Visualization

The visualization is using original decoder heads except DINOv2. Since the predicted representations are different than the originals, decoder head could not process it well. That's to say, the learning is not perfect.

Application to Vision Tasks

Theia can be definitely used for vision tasks, by using either the Theia-representation or predicted representations. But we don’t claim that current Theia outperforms all teacher models on all vision tasks. There are several possible reasons:

Last but not least, the ultimate goal is to have a unified model developed for both, which carries the fundamental visual understanding capabilities. Again I find this is a great challenge and an interesting opportunity for the community.

[1] Tong, Shengbang, et al. "Cambrian-1: A fully open, vision-centric exploration of multimodal llms." arXiv preprint arXiv:2406.16860 (2024).

ThomasCai35 commented 2 weeks ago

Thanks for you reply. It helped a lot! But we still had some problems about your latest reply. I made a conclusion from your paper's figure1 (right) that Theia model's perfromance in robotics tasks was better than all the other models you listed. And we found that Theia model didn't outperform some individual models in vision tasks. Here was my first problem. You mentioned that diverse visual understanding concepts would enrich the representation and be useful for robot tasks. I wondered that if it was the 'overfit' that made DINOv2 model performed better in vision tasks. Meanwhile I wondered that if it was the 'overfit' that made DINOv2 performed worse in the aspect of visual understanding concepts, which led to the result that DINOv2 performed worse than Theia model in robot tasks. Here was my second problem. You mentioned that 'In my opinion, supervised model for vision tasks may "overfit" to the specific task to get superior performance, but lose the general vision understanding capability for a bit. I personally feel that's why DINOv2 is the top on robot learning performance among all foundation model tested.' I got a little bit confused about the link bewteen those two sentences. Here was my understanding: Self-supervision leads to possible overfit. Overfit leads to better vision task performance and worse vision understanding. Worse understanding leads to worse robot task performance. Was that Right? Here was my thrid problem. You mentioned that Theia model was not supervised on any vision task. Was it the main reason that led to the worse vision performance in contrast to individual models? Thanks for your reply.

elicassion commented 2 weeks ago
ThomasCai35 commented 2 weeks ago

Thanks for your timely reply. Your answer helped me tackle the dilemma well. Plus, it gave me some new ideas and insights. Wish you good luck in the future.