Open tyleryzhu opened 2 months ago
Hello, from code https://github.com/RifleZhang/LLaVA-Hound-DPO/blob/main/llava_hound_dpo/sft_scripts/video_sft_qa_240k.sh#L19 for SFT stage, it is 100k image + 240k video QA. A small set of 15k caption is mixed, which inspired from ShareGPT4V training, but we didn't tested if that data is removed.
It seems like there's three different SFT setups recommended between the code and the paper.
Paper:
Code (your ckpt):
Code (new recipe I assume?):
I assume the new recipe is one you tested and gets the same/better numbers than those in the paper? If you could clarify the different settings that would be much appreciated. Thank you!