DachengLi1 / LongChat

Official repository for LongChat and LongEval
Apache License 2.0
504 stars 29 forks source link

Maybe a bug in the preprocess? #26

Open Richar-Du opened 1 year ago

Richar-Du commented 1 year ago

Thanks for your awesome work so that the community can train LLM on very long context! However, I find that in the preprocess function, line https://github.com/DachengLi1/LongChat/blob/a824bda25c0082e60973c35c79b0f35d69c6be2d/longchat/train/fine_tune/train.py#L125 and line: https://github.com/DachengLi1/LongChat/blob/a824bda25c0082e60973c35c79b0f35d69c6be2d/longchat/train/fine_tune/train.py#L137 will set target to: [1, -100, -100, ...], with the first element is not ignored. I think Fastchat gives the correct code, which is first setting target[:cur_len] = IGNORE_TOKEN_ID so the target will be [-100, -100, -100, ...]. Am I right? @DachengLi1

DachengLi1 commented 1 year ago

@Richar-Du Thanks a lot for the feedback! I do think you are correct. I remember I set the 1 for some reason (should be some annoying tokenizer mismatch problem).. It wasn't giving a trouble when training the first version of longchat, so I leave it here..

You are very right, actually I am meeting some potential bug in data processing in upgrading longchat with llama-2.. Actually need to debug more.. Let me know if you find the correct way to preprocess it.

Richar-Du commented 1 year ago

@DachengLi1 Thanks for your reply and I'm very glad to give some feedback:) Honestly speaking, I am still trying to debug it because I'm not quite familiar with it. I add target[:cur_len] = IGNORE_TOKEN_ID to change 1 to -100 but the training result is still abnormal. I am going to compare the code of fastchat and longchat and try to solve it.

Richar-Du commented 1 year ago

@DachengLi1 I find that the only difference between fastchat and longchat is that longchat use RoPE, so if I add replace_llama_with_condense(ratio=8) to the train_mem.py in fastchat, fastchat will be the same as longchat?

As far as I tried now, add replace_llama_with_condense(ratio=8) could make fastchat support long-context fine-tuning.