PKU-YuanGroup / MoE-LLaVA

Mixture-of-Experts for Large Vision-Language Models
https://arxiv.org/abs/2401.15947
Apache License 2.0
1.94k stars 123 forks source link

用LLava官方脚本替换Qwen2,用mpt的template训练 loss 0 #42

Open lucasjinreal opened 7 months ago

lucasjinreal commented 7 months ago

楼主有遇到过类似的情况吗? {'loss': 0.0, 'learning_rate': 0.001435114503816794, 'epoch': 0.02}
2%|██▊ | 188/8720 [14:45<11:05:28, 4.68s/it]WARNING: tokenization mismatch: 58 vs. 59. (ignored) WARNING: tokenization mismatch: 41 vs. 42. (ignored)

LinB203 commented 7 months ago

是由于conv template和process函数不匹配导致的。你可以参考custom.md。 [En] the conv template mismatched with process function, you can refer to the custom.md 或者直接用我们提供的关于qwen的代码。 [En] We provide the code for Qwen2 https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/39

应当使用qwen的conv template。 [En] You can try the qwen conv template that we have provided.

lucasjinreal commented 7 months ago

我应该已经调整了conv template为 mpt 格式。

其次,我用的是transformers最新版,会有

WARNING: tokenization mismatch: 42 vs. 43. (ignored) WARNING: tokenization mismatch: 44 vs. 45. (ignored) WARNING: tokenization mismatch: 51 vs. 52. (ignored) WARNING: tokenization mismatch: 45 vs. 46. (ignored) WARNING: tokenization mismatch: 48 vs. 49. (ignored) WARNING: tokenization mismatch: 43 vs. 44. (ignored)

warnin 出现

btw, 我使用的Qwen1.5. tokenizer中应该已经包含了special token:

{
  "add_prefix_space": false,
  "added_tokens_decoder": {
    "151643": {
      "content": "<|endoftext|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151644": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151645": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": ["<|im_start|>", "<|im_end|>"],
  "bos_token": null,
  "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '\n'}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant\n' }}{% endif %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|im_end|>",
  "errors": "replace",
  "model_max_length": 32768,
  "pad_token": "<|endoftext|>",
  "split_special_tokens": false,
  "tokenizer_class": "Qwen2Tokenizer",
  "unk_token": null
}
LinB203 commented 7 months ago

你能澄清一下什么是mpt的conv template吗?贴一下你的运行命令。 [En] What's the mpt conv template? Could you post you run command?

lucasjinreal commented 7 months ago

我发现你是改了这个地方:

 if has_image:
                round_len = len(tokenizer_image_token(rou, tokenizer)) + 1 # for eos_token
                instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1  # instruction_len is before the answer
            else:
                round_len = len(tokenizer(rou).input_ids)
                instruction_len = len(tokenizer(parts[0]).input_ids) - 1

这里我好像没改,但是有个问题,你对应的qwen conv template是:

conv_qwen = Conversation(
    system="A chat between a curious user and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the user's questions.",
    roles=("USER", "ASSISTANT"),
    version="qwen",  # replace
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.TWO,
    sep=" ",
    sep2="<|endoftext|>",  # replace with eos_token
)

我用的是:

conv_mpt = Conversation(
    system="""<|im_start|>system
You should follow the instructions carefully and explain your answers in detail.""",
    # system = None,
    roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
    version="mpt",
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.MPT,
    sep="<|im_end|>",
)

因为基于qwen的chat模型是chatml的格式,难道不应该沿用chatml会好一点么

chatml 上面的process要咋改

LinB203 commented 7 months ago

我觉得system prompt不会太影响模型性能。你可以把system="A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."改成system="You should follow the instructions carefully and explain your answers in detail." [En] I think system prompt will not affect the performance seriously. You can modify system="A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions." to system="You should follow the instructions carefully and explain your answers in detail."

lucasjinreal commented 7 months ago

我尝试修改了preprocess, 沿用了chatml template, loss还是 0.

理论上来说,和你的template也是多了一个 eos,应该不至于loss 泵掉

此外,修改之后,依旧存在 WARNING: tokenization mismatch: 49 vs. 50. (ignored) WARNING: tokenization mismatch: 47 vs. 48. (ignored) WARNING: tokenization mismatch: 46 vs. 47. (ignored) WARNING: tokenization mismatch: 54 vs. 55. (ignored) WARNING: tokenization mismatch: 45 vs. 46. (ignored) WARNING: tokenization mismatch: 60 vs. 61. (ignored) WARNING: tokenization mismatch: 48 vs. 49. (ignored) WARNING: tokenization mismatch: 48 vs. 49. (ignored) WARNING: tokenization mismatch: 44 vs. 45. (ignored)

LinB203 commented 7 months ago

Could you post you run command?

lucasjinreal commented 7 months ago

@LinB203 Yes:

MODEL_VERSION=qwen-1.8b

########### DO NOT CHANGE ###########
########### USE THIS FOR BOTH ###########
PROMPT_VERSION=qwen

deepspeed train_xformers.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path ./checkpoints/$MODEL_VERSION \
    --version $PROMPT_VERSION \
    --data_path ./data/llava_0.1/pretrain_data.json \
    --image_folder ./data/images \
    --vision_tower ./checkpoints/open-clip-vit-large-patch14-336px \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 False \
    --output_dir ./checkpoints/llava-$MODEL_VERSION-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 24000 \
    --save_total_limit 1 \
    --learning_rate 1e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True

Am doing pretrain stage and on 4xV100 for testing.

LinB203 commented 7 months ago

https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/scripts/v1/qwen/pretrain.sh#L9 --version plain in Stage 1.

lucasjinreal commented 7 months ago

I changed to plain, still got loss 0

PROMPT_VERSION=plain
########### DO NOT CHANGE ###########

deepspeed train_xformers.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path ./checkpoints/$MODEL_VERSION \
    --version $PROMPT_VERSION \
    --data_path ./data/llava_0.1/pretrain_data.json \
    --image_folder ./data/images \
    --vision_tower ./checkpoints/chinese-clip-vit-large-patch14-336px \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 False \
    --output_dir ./checkpoints/llava-$MODEL_VERSION-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 24000 \
    --save_total_limit 1 \
    --learning_rate 1e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True

and the warning perssist, why

WARNING: tokenization mismatch: 59 vs. 61. (ignored)
WARNING: tokenization mismatch: 54 vs. 56. (ignored)
WARNING: tokenization mismatch: 57 vs. 59. (ignored)
WARNING: tokenization mismatch: 65 vs. 67. (ignored)
WARNING: tokenization mismatch: 70 vs. 72. (ignored)
WARNING: tokenization mismatch: 62 vs. 64. (ignored)
WARNING: tokenization mismatch: 62 vs. 64. (ignored)
WARNING: tokenization mismatch: 61 vs. 63. (ignored)
{'loss': 0.0, 'learning_rate': 1.5267175572519083e-05, 'epoch': 0.0}     
LinB203 commented 7 months ago

Sorry, I can not reproduce your error. Please repull the latest code and follow the custom.md, which is enough clear to implement Qwen1.5 as https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/39 done.

The Qwen1.5 scripts are same with qwen and only need to modify --model_name_or_path. Our latest code is support Qwen1.5 https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/39#issuecomment-1945654824.

lucasjinreal commented 7 months ago

I got it work now, the loss shows:

{'loss': 16.2781, 'learning_rate': 3.816793893129771e-06, 'epoch': 0.0}                                                                                                         
{'loss': 15.7207, 'learning_rate': 7.633587786259541e-06, 'epoch': 0.0}                                                                                                         
{'loss': 15.9175, 'learning_rate': 1.1450381679389314e-05, 'epoch': 0.0}                                                                                                        
{'loss': 15.8711, 'learning_rate': 1.5267175572519083e-05, 'epoch': 0.0}                                                                                                        
{'loss': 15.352, 'learning_rate': 1.9083969465648855e-05, 'epoch': 0.0}                                                                                                         
{'loss': 14.9108, 'learning_rate': 2.2900763358778628e-05, 'epoch': 0.0}                                                                                                        
{'loss': 14.1826, 'learning_rate': 2.6717557251908397e-05, 'epoch': 0.0}                                                                                                        
{'loss': 13.4192, 'learning_rate': 3.0534351145038166e-05, 'epoch': 0.0} 

it's it normal for pretrain stage? looks like very huge

lucasjinreal commented 7 months ago

@LinB203 The 1.5 support very nice! Then you must upgraded to latest tansofmers to support qwen2 tokenizer? How about using transformers's MOE arch to minimal the code!

lucasjinreal commented 7 months ago

BTW, did u tried open both vision tower and projector in both stage1 and stage2?

LinB203 commented 7 months ago

I haven't run qwen2 on clip-large-336, it will converge to around 0.8 in siglip-384. We support qwen2 tokenizer https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/moellava/train/train.py#L1402. We did not modify the MLP projector and you can change vision encoder by following here.

lucasjinreal commented 7 months ago

Is qwen1.8b on 1 epoch? with llava pretrain dataset?

BTW, did u tried open both vision tower and projector in both stage1 and stage2?

Do u tried this training step same as Yi-6b-VL?

LinB203 commented 7 months ago

Is qwen1.8b on 1 epoch? with llava pretrain dataset?

BTW, did u tried open both vision tower and projector in both stage1 and stage2?

Do u tried this training step same as Yi-6b-VL?

Yes, with sharegpt4v pretrain dataset. No.

lucasjinreal commented 7 months ago

@LinB203 Shouldn't be 1.8 MoE? Does the model open? Does the pretrained (no finetune) model able to do simple image caption ?

LinB203 commented 7 months ago

@LinB203 Shouldn't be 1.8 MoE? Does the model open? Does the pretrained (no finetune) model able to do simple image caption ?

The model have not been released. Yes.

lucasjinreal commented 7 months ago

@LinB203

  1. When will the model release? Looks like better than currect model, is it also 1.8bx4 moe?
  2. Which pretrained data you were using? sharegpt4v_instruct_gpt4-vision_cap100k.json or pt part? Does pt part contains many noise maybe?
lucasjinreal commented 7 months ago

@LinB203 I think the pretrain loss hard to be 0.2, the official pretrain loss of llava is about 1.9:

image

How did u guys manageed trained pretrain loss so small?

LinB203 commented 7 months ago

Schedule to next month. We use the pretrained dataset from sharegpt4v, which is about 1.2M QA pairs.

@LinB203

  1. When will the model release? Looks like better than currect model, is it also 1.8bx4 moe?
  2. Which pretrained data you were using? sharegpt4v_instruct_gpt4-vision_cap100k.json or pt part? Does pt part contains many noise maybe?
LinB203 commented 7 months ago

We use the pretrained dataset from sharegpt4v, which is about 1.2M QA pairs.

We use the pretrained dataset from sharegpt4v, which is about 1.2M QA pairs.

lucasjinreal commented 7 months ago

@LinB203 I found the sharegpt4v also lack of Chinese part data. Do u think any hight quality Chinese pretrain image-text pair can be used to enhance Chinese ability?

LinB203 commented 7 months ago

Refer to https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/26#issuecomment-1928765474.

lucasjinreal commented 7 months ago

@LinB203 Do u think raw ocr image-text paris can be used in pretrain data?