Coobiw / MPP-LLaVA

Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.
349 stars 19 forks source link

Probably lower loss when use `train_pipeline.py` #22

Closed Coobiw closed 2 months ago

Coobiw commented 2 months ago

https://github.com/Coobiw/MiniGPT4Qwen/blob/e056abb6dbd19390434ca9f8f666806e6961cc9d/lavis/models/minigpt4qwen_models/minigpt4qwen_pipe.py#L202

In this implementation of next-token-prediction loss, as for one sequence, the summed loss will be divided by the max_txt_len, rather than the number of actual computed tokens(except padding token).

In the implementation of huggingface:

if attention_mask is not None:
    shift_attention_mask = attention_mask[..., 1:]
    shift_logits = logits[..., :-1, :][shift_attention_mask.to(logits.device) != 0].contiguous()
    shift_labels = labels[..., 1:][shift_attention_mask.to(labels.device) != 0].contiguous()
else:
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()

It will use attention_mask to fix the value of the denominator. So our loss may be discoverd lower? But I think it will not lead to some more differences.

Coobiw commented 2 months ago

Don't need to fix. Just for remind purpose.