Closed ShiYaya closed 2 months ago
What are the principles behind the selection of baseline models? The ShareCaptioner-Video model is trained using the IXC2-4KHD dataset, while the ShareGPT4Video-8B model is trained on the LLaVA-Next-8B dataset. What is the rationale behind this choice?
We use LLaVA-Next-8B for our ShareGPT4Video-8B for easy reproduction. We choose InternLM-XComposer2-4KHD which can handle a wide range of resolutions and aspect ratios of images to perform a general captioner for various videos.
What are the principles behind the selection of baseline models? The ShareCaptioner-Video model is trained using the IXC2-4KHD dataset, while the ShareGPT4Video-8B model is trained on the LLaVA-Next-8B dataset. What is the rationale behind this choice?