Open zoldaten opened 3 days ago
i saw https://huggingface.co/Vision-CAIR/LongVU_Llama3_2_1B exists . Is it image or video part ? could it be combined with LongVU_Llama3_2_3B ? (image or video) and what hardware requirements ?
LongVU_Llama3_2_1B model is the video llm with Llama3_2_1B language backbone. Similarly, LongVU_Llama3_2_3B is with the 3B language backbone Llama3_2_3B. What do you mean by combing both model?
i saw https://huggingface.co/Vision-CAIR/LongVU_Llama3_2_1B exists . Is it image or video part ? could it be combined with LongVU_Llama3_2_3B ? (image or video) and what hardware requirements ?