TinyLLaVA / TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models
https://arxiv.org/abs/2402.14289
Apache License 2.0
544 stars 49 forks source link

Why `group_by_modality` is set to False on the `custom_finetune.sh` script? #89

Open ggcr opened 2 months ago

ggcr commented 2 months ago

Hi,

I am exploring on doing different experiments, by training the projector only, and I've seen that the default provided script for custom finetuning has the group_by_modality_length flag set to False. https://github.com/TinyLLaVA/TinyLLaVA_Factory/blob/f5c2ded3010ea19bda543316b0b94474052a615e/scripts/train/custom_finetune.sh#L1-L24

Because it has the llava_v1_5_mix665k.json as Data Path, it is not a uni-modal dataset, as it contains conversations with the <image> mm token.

Can you clarify when we should set the group_by_modality_length flag to True?

I am running in errors that are due to this:

E.g.

File "/root/TinyLLaVA_Factory/tinyllava/train/tinyllava_trainer.py", line 47, in get_modality_length_grouped_indices
    lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0])
ValueError: not enough values to unpack (expected 2, got 0)

Thanks in advance.

ggcr commented 1 month ago

There is an inconsistency, in some (full) finetuning scripts it is set to False, and on others True.

ggcr commented 1 month ago

I have a dataset of pure multi-modal data, that is, each question ("from": "gpt") contains the <image> label. Hence, this error should not pop up.

LLaVA author addresses it in this issue.

I created a PR with the same fix that the main LLaVA repo currently is using.

ZhangXJ199 commented 1 month ago

Thanks for your question, group_by_modality_length needs to be set according to the data type: for data that only contains multimodal QA, group_by_modality_length needs to be set to False; for data that contains both multimodal QA and pure text QA (such as SQA), group_by_modality_length can be set to True.