Open Nastu-Ho opened 5 months ago
*qa.yaml
I find it a bit odd to use different training datasets depending on the benchmark. For example, with Videochat2, all instruction datasets were trained together and then evaluated on various benchmarks (as far as I know). However, for ST-LLM, different instruction datasets are used for training based on the benchmark and then evaluated separately. Doesn’t this seem unfair? I’m curious about the rationale behind dividing the data this way.
Hello, thank you for pointing out the issue. The reason we did this is that using instruction data in the form of multiple-choice questions from datasets like K400, SSV2, and CLEVRER, although beneficial for MVBench, would severely impact the model's dialogue performance, leading to significant hallucinations. Our approach actually used less data to achieve better results.
Which configuration file can reproduce the 54.x effect of the paper?