Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
8.25k stars 843 forks source link

Gemma: A study of the effect of the new issues #1031

Open Andrei-Aksionov opened 4 months ago

Andrei-Aksionov commented 4 months ago

Daniel Han in his blogpost shared his discoveries of what is ~also~ wrong with the Gemma implementation. Some of them are only appliable to Keras, some of them affects PyTorch too. This issue will include information concerning those bugs and ours codebase.

Feel free to chime in.

Thanks to @kashif for letting us know.

Andrei-Aksionov commented 4 months ago

Issue # 3

We don't downcast the scale value and keep it in torch.float32. So this one doesn't affect us. https://github.com/Lightning-AI/lit-gpt/blob/f241d94df59d82b2017bfdcd3800ac8779eb45f5/lit_gpt/model.py#L90-L91

Andrei-Aksionov commented 4 months ago

Issue # 8

Was fixed in #1004

awaelchli commented 4 months ago

Issue 2 is ok I think: https://github.com/Lightning-AI/litgpt/blob/f241d94df59d82b2017bfdcd3800ac8779eb45f5/chat/base.py#L369 We also add the new line.

awaelchli commented 4 months ago

Issue 1 is also ok I think. LitGPT's tokenizer adds bos if the tokenizer config has this: https://huggingface.co/google/gemma-2b-it/blob/718cb189da9c5b2e55abe86f2eeffee9b4ae0dad/tokenizer_config.json#L2

Andrei-Aksionov commented 4 months ago

Issue 2 is ok I think:

Yep. Maybe it's a type in the technical report

Screenshot 2024-03-08 at 7 28 48 PM

since here model's part ends with <end_of_turn>/model, while the prompt should be structured in this way:

<start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
{model prompt}<end_of_turn>

meaning that right after <end_of_turn> should be only <start_of_turn.

Anyway, our prompt doesn't have this issue and this is all what we can control.

Andrei-Aksionov commented 4 months ago

Issue 1 is also ok I think. LitGPT's tokenizer adds bos if the tokenizer config has this: https://huggingface.co/google/gemma-2b-it/blob/718cb189da9c5b2e55abe86f2eeffee9b4ae0dad/tokenizer_config.json#L2

Yep, the config has "add_bos_token": true and in that case our tokenizer adds it: https://github.com/Lightning-AI/litgpt/blob/f241d94df59d82b2017bfdcd3800ac8779eb45f5/lit_gpt/tokenizer.py#L96-L100

awaelchli commented 4 months ago

Yes that's what the blog you linked says, that there is a typo in the technical report. So we should not do what is in the technical report of Google. That's my understanding.