Closed dragen1860 closed 5 months ago
The good performance of video LLMs relies on image LLMs, and the role of the large language model itself is not that significant. Currently, the best image LLM is LLaVA1.6. We are now conducting new experiments based on LLaVA1.6 and will update the code trained on LLaVA1.6 soon.
Hi, dear all: Thought the paper did superior performance with only vicuna-7b models, I want to exploere the potentials on stronger LLMs, such as llama3 or Yi. Anyone give some tips on how to modify the code to support llama3 training with STLLM datasets. Thank you ...