Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
9.79k stars 980 forks source link

Enable EOS and BOS tokens? #545

Closed nkasmanoff closed 12 months ago

nkasmanoff commented 1 year ago

Please correct me if I'm wrong, but shouldn't there be a BOS and EOS token enabled for these kinds of tokenization?

https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/scripts/prepare_dolly.py#L116

My thinking is that during inference since the fine-tuned LLM doesn't know that answering a question results in an EOS, the outputs might go on for too long.

Can be resolved by setting eos & bos to true, or making that an argument in these function. If people agree with me, happy to open a PR to amend this for the various tokenization scripts.

rasbt commented 1 year ago

Very good point.

I think it depends on the model and tokenizer used. E.g. GPT models don't use BOS or EOS tokens (only <|endoftext|> when concatenating multiple texts).

But I also agree with you and think there should definitely be tokens enabled though, depending on the model.

What we need to do here is probably just extract this info from the tokenizer.json files (see screenshot below from the Llama 2 one, which uses a BOS but not EOS token).

But then, I think it's already being taken care of by the Tokenizer class: https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/scripts/prepare_dolly.py#L51C1-L51C1

Any insights @carmocca ?

Screenshot 2023-09-13 at 7 56 37 AM
nkasmanoff commented 1 year ago

Thanks for the quick reply! That’s important surrounding code & configs I missed. Would be curious to hear @carmocca 's thoughts too, because I think the tokenizer loads that config, but doesn’t use it for declaring which special tokens to use and not use.

Maybe around here (https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/lit_gpt/tokenizer.py#L33)

There are some lines like: self.bos = config.get("add_bos_token")

and you can use that for the including these special tokens in the encoding function. Can still enable some sort of override in the encode function too.

One issue with this is I took a look at a Falcon config, and it looks like the info is stored slightly differently :-) 
https://huggingface.co/tiiuae/falcon-180B/blob/main/tokenizer_config.json

carmocca commented 1 year ago

Finetuning only uses the input_ids field (https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/finetune/lora.py#L283) which does set eos=True (https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/scripts/prepare_dolly.py#L117)

But you are right that we don't use bos by default, whereas HuggingFace does:

from pathlib import Path
from transformers import LlamaTokenizer
from lit_gpt.tokenizer import Tokenizer

prompt = "Hello friends!"

ours = Tokenizer(Path("checkpoints/openlm-research/open_llama_7b"))
print(ours.encode(prompt))  # [16644, 2439, 31905]

theirs = LlamaTokenizer.from_pretrained("openlm-research/open_llama_7b")
print(theirs.encode(prompt))  # [1, 16644, 2439, 31905]

And yes, this might need special consideration for the different checkpoints. GptNeoX based models dont seem to use BOS (Falcon, Pythia)

nkasmanoff commented 1 year ago

@carmocca is this a feature you think worth adding to lit-gpt? Glad to volunteer/ submit a PR that makes this special consideration for the different model types, and defaults to not including the BOS (& EOS) token.

carmocca commented 1 year ago

Absolutely. However one detail:

defaults to not including the BOS (& EOS) token.

Why would this be? This would be chosen based on the model's tokenizer configuration.

nkasmanoff commented 1 year ago

Thanks! And ah that wasn't clear by me, what I meant was if for some reason I can't find whether or not to use the BOS token (like it's not evident to me in the metadata) I will make it so that the default is to have the same behavior as there is now.

Apologies if that's still confusing I hope it will make more sense when I send a PR your way!