PKU-YuanGroup / MoE-LLaVA

Mixture-of-Experts for Large Vision-Language Models
Apache License 2.0
1.94k stars 123 forks source link

用LLava官方脚本替换Qwen2,用mpt的template训练 loss 0 #42

Open lucasjinreal opened 7 months ago

lucasjinreal commented 7 months ago

楼主有遇到过类似的情况吗? {'loss': 0.0, 'learning_rate': 0.001435114503816794, 'epoch': 0.02}
2%|██▊ | 188/8720 [14:45<11:05:28, 4.68s/it]WARNING: tokenization mismatch: 58 vs. 59. (ignored) WARNING: tokenization mismatch: 41 vs. 42. (ignored)

LinB203 commented 7 months ago

是由于conv template和process函数不匹配导致的。你可以参考。 [En] the conv template mismatched with process function, you can refer to the 或者直接用我们提供的关于qwen的代码。 [En] We provide the code for Qwen2

应当使用qwen的conv template。 [En] You can try the qwen conv template that we have provided.

lucasjinreal commented 7 months ago

我应该已经调整了conv template为 mpt 格式。


WARNING: tokenization mismatch: 42 vs. 43. (ignored) WARNING: tokenization mismatch: 44 vs. 45. (ignored) WARNING: tokenization mismatch: 51 vs. 52. (ignored) WARNING: tokenization mismatch: 45 vs. 46. (ignored) WARNING: tokenization mismatch: 48 vs. 49. (ignored) WARNING: tokenization mismatch: 43 vs. 44. (ignored)

warnin 出现

btw, 我使用的Qwen1.5. tokenizer中应该已经包含了special token:

  "add_prefix_space": false,
  "added_tokens_decoder": {
    "151643": {
      "content": "<|endoftext|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    "151644": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    "151645": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
  "additional_special_tokens": ["<|im_start|>", "<|im_end|>"],
  "bos_token": null,
  "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '\n'}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant\n' }}{% endif %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|im_end|>",
  "errors": "replace",
  "model_max_length": 32768,
  "pad_token": "<|endoftext|>",
  "split_special_tokens": false,
  "tokenizer_class": "Qwen2Tokenizer",
  "unk_token": null
LinB203 commented 7 months ago

你能澄清一下什么是mpt的conv template吗?贴一下你的运行命令。 [En] What's the mpt conv template? Could you post you run command?

lucasjinreal commented 7 months ago


 if has_image:
                round_len = len(tokenizer_image_token(rou, tokenizer)) + 1 # for eos_token
                instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1  # instruction_len is before the answer
                round_len = len(tokenizer(rou).input_ids)
                instruction_len = len(tokenizer(parts[0]).input_ids) - 1

这里我好像没改,但是有个问题,你对应的qwen conv template是:

conv_qwen = Conversation(
    system="A chat between a curious user and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the user's questions.",
    roles=("USER", "ASSISTANT"),
    version="qwen",  # replace
    sep=" ",
    sep2="<|endoftext|>",  # replace with eos_token


conv_mpt = Conversation(
You should follow the instructions carefully and explain your answers in detail.""",
    # system = None,
    roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),


chatml 上面的process要咋改

LinB203 commented 7 months ago

我觉得system prompt不会太影响模型性能。你可以把system="A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."改成system="You should follow the instructions carefully and explain your answers in detail." [En] I think system prompt will not affect the performance seriously. You can modify system="A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions." to system="You should follow the instructions carefully and explain your answers in detail."

lucasjinreal commented 7 months ago

我尝试修改了preprocess, 沿用了chatml template, loss还是 0.

理论上来说,和你的template也是多了一个 eos,应该不至于loss 泵掉

此外,修改之后,依旧存在 WARNING: tokenization mismatch: 49 vs. 50. (ignored) WARNING: tokenization mismatch: 47 vs. 48. (ignored) WARNING: tokenization mismatch: 46 vs. 47. (ignored) WARNING: tokenization mismatch: 54 vs. 55. (ignored) WARNING: tokenization mismatch: 45 vs. 46. (ignored) WARNING: tokenization mismatch: 60 vs. 61. (ignored) WARNING: tokenization mismatch: 48 vs. 49. (ignored) WARNING: tokenization mismatch: 48 vs. 49. (ignored) WARNING: tokenization mismatch: 44 vs. 45. (ignored)

LinB203 commented 7 months ago

Could you post you run command?

lucasjinreal commented 7 months ago

@LinB203 Yes:


########### DO NOT CHANGE ###########
########### USE THIS FOR BOTH ###########

deepspeed \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path ./checkpoints/$MODEL_VERSION \
    --version $PROMPT_VERSION \
    --data_path ./data/llava_0.1/pretrain_data.json \
    --image_folder ./data/images \
    --vision_tower ./checkpoints/open-clip-vit-large-patch14-336px \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 False \
    --output_dir ./checkpoints/llava-$MODEL_VERSION-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 24000 \
    --save_total_limit 1 \
    --learning_rate 1e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True

Am doing pretrain stage and on 4xV100 for testing.

LinB203 commented 7 months ago --version plain in Stage 1.

lucasjinreal commented 7 months ago

I changed to plain, still got loss 0

########### DO NOT CHANGE ###########

deepspeed \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path ./checkpoints/$MODEL_VERSION \
    --version $PROMPT_VERSION \
    --data_path ./data/llava_0.1/pretrain_data.json \
    --image_folder ./data/images \
    --vision_tower ./checkpoints/chinese-clip-vit-large-patch14-336px \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 False \
    --output_dir ./checkpoints/llava-$MODEL_VERSION-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 24000 \
    --save_total_limit 1 \
    --learning_rate 1e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True

and the warning perssist, why

WARNING: tokenization mismatch: 59 vs. 61. (ignored)
WARNING: tokenization mismatch: 54 vs. 56. (ignored)
WARNING: tokenization mismatch: 57 vs. 59. (ignored)
WARNING: tokenization mismatch: 65 vs. 67. (ignored)
WARNING: tokenization mismatch: 70 vs. 72. (ignored)
WARNING: tokenization mismatch: 62 vs. 64. (ignored)
WARNING: tokenization mismatch: 62 vs. 64. (ignored)
WARNING: tokenization mismatch: 61 vs. 63. (ignored)
{'loss': 0.0, 'learning_rate': 1.5267175572519083e-05, 'epoch': 0.0}     
LinB203 commented 7 months ago

Sorry, I can not reproduce your error. Please repull the latest code and follow the, which is enough clear to implement Qwen1.5 as done.

The Qwen1.5 scripts are same with qwen and only need to modify --model_name_or_path. Our latest code is support Qwen1.5

lucasjinreal commented 7 months ago

I got it work now, the loss shows:

{'loss': 16.2781, 'learning_rate': 3.816793893129771e-06, 'epoch': 0.0}                                                                                                         
{'loss': 15.7207, 'learning_rate': 7.633587786259541e-06, 'epoch': 0.0}                                                                                                         
{'loss': 15.9175, 'learning_rate': 1.1450381679389314e-05, 'epoch': 0.0}                                                                                                        
{'loss': 15.8711, 'learning_rate': 1.5267175572519083e-05, 'epoch': 0.0}                                                                                                        
{'loss': 15.352, 'learning_rate': 1.9083969465648855e-05, 'epoch': 0.0}                                                                                                         
{'loss': 14.9108, 'learning_rate': 2.2900763358778628e-05, 'epoch': 0.0}                                                                                                        
{'loss': 14.1826, 'learning_rate': 2.6717557251908397e-05, 'epoch': 0.0}                                                                                                        
{'loss': 13.4192, 'learning_rate': 3.0534351145038166e-05, 'epoch': 0.0} 

it's it normal for pretrain stage? looks like very huge

lucasjinreal commented 7 months ago

@LinB203 The 1.5 support very nice! Then you must upgraded to latest tansofmers to support qwen2 tokenizer? How about using transformers's MOE arch to minimal the code!

lucasjinreal commented 7 months ago

BTW, did u tried open both vision tower and projector in both stage1 and stage2?

LinB203 commented 7 months ago

I haven't run qwen2 on clip-large-336, it will converge to around 0.8 in siglip-384. We support qwen2 tokenizer We did not modify the MLP projector and you can change vision encoder by following here.

lucasjinreal commented 7 months ago

Is qwen1.8b on 1 epoch? with llava pretrain dataset?

BTW, did u tried open both vision tower and projector in both stage1 and stage2?

Do u tried this training step same as Yi-6b-VL?

LinB203 commented 7 months ago

Is qwen1.8b on 1 epoch? with llava pretrain dataset?

BTW, did u tried open both vision tower and projector in both stage1 and stage2?

Do u tried this training step same as Yi-6b-VL?

Yes, with sharegpt4v pretrain dataset. No.

lucasjinreal commented 7 months ago

@LinB203 Shouldn't be 1.8 MoE? Does the model open? Does the pretrained (no finetune) model able to do simple image caption ?

LinB203 commented 7 months ago

@LinB203 Shouldn't be 1.8 MoE? Does the model open? Does the pretrained (no finetune) model able to do simple image caption ?

The model have not been released. Yes.

lucasjinreal commented 7 months ago


  1. When will the model release? Looks like better than currect model, is it also 1.8bx4 moe?
  2. Which pretrained data you were using? sharegpt4v_instruct_gpt4-vision_cap100k.json or pt part? Does pt part contains many noise maybe?
lucasjinreal commented 7 months ago

@LinB203 I think the pretrain loss hard to be 0.2, the official pretrain loss of llava is about 1.9:


How did u guys manageed trained pretrain loss so small?

LinB203 commented 7 months ago

Schedule to next month. We use the pretrained dataset from sharegpt4v, which is about 1.2M QA pairs.


  1. When will the model release? Looks like better than currect model, is it also 1.8bx4 moe?
  2. Which pretrained data you were using? sharegpt4v_instruct_gpt4-vision_cap100k.json or pt part? Does pt part contains many noise maybe?
LinB203 commented 7 months ago

We use the pretrained dataset from sharegpt4v, which is about 1.2M QA pairs.

We use the pretrained dataset from sharegpt4v, which is about 1.2M QA pairs.

lucasjinreal commented 7 months ago

@LinB203 I found the sharegpt4v also lack of Chinese part data. Do u think any hight quality Chinese pretrain image-text pair can be used to enhance Chinese ability?

LinB203 commented 7 months ago

Refer to

lucasjinreal commented 7 months ago

@LinB203 Do u think raw ocr image-text paris can be used in pretrain data?