ChenYi99 / EgoPlan

BSD 3-Clause "New" or "Revised" License
51 stars 6 forks source link

Discussion for closed-sourced VLM baselines #6

Closed yusuke-intern closed 2 months ago

yusuke-intern commented 2 months ago

Hello,

Regarding the implementation of GPT4V as the baseline in the EgoPlan paper, unfortunately, the specifics of how it was implemented aren't readily available. Even if their conclusion is that fine-tuned VLM is better than closed-sourced VLM, I think it would indeed be beneficial to have open discussions about baselines with API-based VLMs, excluding the fine-tuning process.

Generally in datascience competitions such as Kaggle, there are many publicly available baseline implementations during the developing duration, which serve as valuable references. I think it would also be valuable to assess the performance of other closed-sourced VLMs like Gemini and Claude. Especially, Gemini can now understand videos, so we should have a baseline with this. Also, some closed-sourced LLMs allow us to fine-tune them.

I'm now investigating the performance of open-sourced VLMs with simple prompt engineering. If there's significant demand for my personal scripts, I am very positive about making them open to the public to have beneficial discussions.