tokenization mismatch - Githubissues

ohhan777 commented 2 months ago

Thank you for sharing the great source code. I have been trying to pretrain and fine-tune with LLaMA 3.1. While the pretraining works fine, I noticed that the following warnings occur during the fine-tuning process, preventing the model from training properly:

WARNING: tokenization mismatch: 276 vs. 272. (ignored)
WARNING: tokenization mismatch: 223 vs. 219. (ignored)
WARNING: tokenization mismatch: 131 vs. 127. (ignored)
WARNING: tokenization mismatch: 915 vs. 911. (ignored)
WARNING: tokenization mismatch: 545 vs. 541. (ignored)
WARNING: tokenization mismatch: 210 vs. 206. (ignored)
WARNING: tokenization mismatch: 177 vs. 173. (ignored)
WARNING: tokenization mismatch: 183 vs. 179. (ignored)
WARNING: tokenization mismatch: 168 vs. 164. (ignored)
WARNING: tokenization mismatch: 155 vs. 151. (ignored)
WARNING: tokenization mismatch: 117 vs. 113. (ignored)
WARNING: tokenization mismatch: 781 vs. 777. (ignored)
WARNING: tokenization mismatch: 204 vs. 200. (ignored)
WARNING: tokenization mismatch: 195 vs. 191. (ignored)
WARNING: tokenization mismatch: 107 vs. 103. (ignored)
WARNING: tokenization mismatch: 334 vs. 330. (ignored)
WARNING: tokenization mismatch: 376 vs. 372. (ignored)
WARNING: tokenization mismatch: 146 vs. 142. (ignored)
WARNING: tokenization mismatch: 121 vs. 117. (ignored)

After checking the source code, I found that in the train.py file, within the preprocess_llama_3_1() function, the cur_len value becomes 4 more than it should be due to the following line of code:

cur_len = cur_len + len(tokenizer(sep, add_special_tokens=False).input_ids)

As a result, all targets are treated as IGNORE_INDEX, and the model does not train. When I commented out this line, the issue seemed to disappear, and the training worked properly. Was this line intentionally included?

sahilqure commented 2 months ago

@ohhan777 Can u send me the logs after commenting it out?

sahil02235 commented 2 months ago

@federico1-creator This is not solved even after commenting that line. Can u look into it

federico1-creator commented 2 months ago

Hi everyone, thank you for your interest in our project !!!

We have conduct some tests to better understand the differences in behavior between the code we're running and the tokenization mismatch issue you mentioned. The problem is the llama 3.1 tokenizer, which was updated by the Meta team. This update create a mismatch between the version we used during development and the one you are currently using.

To fix this issue you can use our tokenizer, which is included in the LLaVA-MORE weights. Specifically, I have already updated the training scripts to use the new TOKENIZER_PATH.

https://github.com/aimagelab/LLaVA-MORE/blob/main/scripts/more/11_pretrain_llama_31_acc_st_1.sh https://github.com/aimagelab/LLaVA-MORE/blob/main/scripts/more/12_finetuning_llama_31_acc_st_1.sh

@ohhan777 @sahilqure @sahil02235

sahil02235 commented 2 months ago

@federico1-creator Thanks for this will check it.

waybarrios commented 1 month ago

If I train everything from the scratch, could I get this error too?

aimagelab / LLaVA-MORE

tokenization mismatch #7