Which SFT setup is recommended now?

It seems like there's three different SFT setups recommended between the code and the paper.

Paper:

Stage 2: 600k image instructions from ALLaVA, 240k video instructions

Code (your ckpt):

Stage 2.1: 600k images, 300k video captions
Stage 2.2: 100k images, 200k video QA

Code (new recipe I assume?):

Stage 2: 600k images, 240k video instruction/QA (?), 15k video captions.

I assume the new recipe is one you tested and gets the same/better numbers than those in the paper? If you could clarify the different settings that would be much appreciated. Thank you!

RifleZhang / LLaVA-Hound-DPO

Which SFT setup is recommended now? #14