Meituan-AutoML / MobileVLM

Strong and Open Vision Language Assistant for Mobile Devices
Apache License 2.0
890 stars 64 forks source link

Significant Performance Drop in Model Replication #18

Closed rangmiao closed 6 months ago

rangmiao commented 6 months ago

Firstly, I would like to express my sincere gratitude for your impactful work. I've encountered a challenge while attempting to replicate your work on a V100. The accuracy post-training is significantly lower than what is reported in your paper. For the GQA, the paper indicates an accuracy of 56.1, but my replication only achieved 52.98.

I carefully followed the steps outlined in the readme, starting with the first phase of training and then proceeding to the second phase, which involved training using the LoRA method. After training, the model was merged using the merge_lora function, and the accuracy was subsequently evaluated. At this point, I am uncertain whether the issue lies in the training phase or during the “merge_lora ”stage. Below are the loss details from both stages of training. the first phase: image During the second phase of the training process, I observed that the loss stabilized around 0.85

I am seeking your guidance to pinpoint where the problem might have occurred in the training process. Additionally, would it be possible for you to share your training logs and detailed information about the model merging process? Such insights would be immensely helpful in resolving the discrepancies I am facing.

Thank you very much for your time and assistance.

Best regards,

YangYang-DLUT commented 6 months ago

Could provide more information about your train settings? The lose stabilized around 0.85 during the second training phase is correct, which is also reflected from the the conclusions of LLaVA.

rangmiao commented 6 months ago

this is the first phase settings:

 {
 "_name_or_path": "models/MobileLLaMA-1.4B-Chat",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "freeze_mm_mlp_adapter": false,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "image_aspect_ratio": "square",
  "image_grid_pinpoints": null,
  "initializer_range": 0.02,
  "intermediate_size": 5632,
  "max_position_embeddings": 2048,
  "max_sequence_length": 2048,
  "mm_hidden_size": 1024,
  "mm_projector_type": "ldpnet",
  "mm_use_im_patch_token": false,
  "mm_use_im_start_end": false,
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "models/clip-vit-large-patch14-336",
  "model_type": "mobilevlm",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "num_key_value_heads": 16,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.33.1",
  "tune_mm_mlp_adapter": true,
  "use_cache": true,
  "use_mm_proj": true,
  "vision_tower_type": "clip",
  "vocab_size": 32000
}

this is the second phase settings:

{
  "_name_or_path": "mobilevlm/mobilevlm1.7b_20240115_223212/mobilevlm-2.finetune-lora",
  "architectures": [
    "MobileLlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "freeze_mm_mlp_adapter": false,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "image_aspect_ratio": "pad",
  "image_grid_pinpoints": null,
  "initializer_range": 0.02,
  "intermediate_size": 5632,
  "max_position_embeddings": 2048,
  "max_sequence_length": 2048,
  "mm_hidden_size": 1024,
  "mm_projector_type": "ldpnet",
  "mm_use_im_patch_token": false,
  "mm_use_im_start_end": false,
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "models/clip-vit-large-patch14-336",
  "model_type": "mobilevlm",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "num_key_value_heads": 16,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.33.1",
  "tune_mm_mlp_adapter": false,
  "use_cache": true,
  "use_mm_proj": true,
  "vision_tower_type": "clip",
  "vocab_size": 32000
}

After completing the second phase of training, I used the merge_lora function in scripts/mergelora.py provided by the authors to further merge the trained parameters based on the MobileVLM-1.7B model。

#scripts/mergelora.py
model_base = 'models/MobileVLM-1.7B'
model_path = 'mobilevlm1.7b_20240115_223212/mobilevlm-2.finetune-lora'
save_path = 'mobilevlm1.7b_20240115_223212/mobilevlm-recurrent-v2'
merge_lora(model_base , model_path , save_path)
er-muyue commented 6 months ago
# scripts/mergelora.py
model_base = 'models/MobileVLM-1.7B'     # should change this path to 'models/MobileLLaMA-1.4B-Chat'
model_path = 'mobilevlm1.7b_20240115_223212/mobilevlm-2.finetune-lora'
save_path = 'mobilevlm1.7b_20240115_223212/mobilevlm-recurrent-v2'
merge_lora(model_base , model_path , save_path)

The model_base in scripts/mergelora.py is the chat-bot language model you used, I notice that you use models/MobileLLaMA-1.4B-Chat in the first phase setting.

Please change model_base to correct path and evaluate your merged-model again.

rangmiao commented 6 months ago

Following the author's suggestion, I used the model models/MobileLLaMA-1.4B-Chat as the base model for the merge_lora process and conducted another evaluation. However, the GQA evaluation score remained unchanged at 52.98. Additionally, I evaluated more datasets and found that only the GQA dataset showed a significant discrepancy. Below are the evaluation results for these datasets. The first row shows the results using the author's provided MobileVLM-1.7B model, and the second row displays my replicated results.

Model GQA SQA VQA POPE MME MMB
MobileVLM 1.7B-github 56.01 54.8 41.7 84.5 1195.58 53.1
MobileVLM_1.7B_lora_replicated 52.98 54.16 40.14 84.76 1192.53 52.92
YangYang-DLUT commented 6 months ago

It seems that GQA is evaluated on cache result. Remove the original result and evaluate again see whether the same value. It should have some fluctuation at least.

rangmiao commented 6 months ago

I greatly appreciate the help from the authors. Indeed, it turns out that I was using cached results. Now, the accuracy for GQA aligns perfectly!