Llama3 finetuning and generation: Double begin_of_text, no eot_id

sanderland commented 3 weeks ago

Bug description

When finetuning Llama3, the encoded data has:

Duplicate <|begin_of_text|> at the start
- Tracked down to template + hf tokenizer both adding one.
No <|eot_id|> at the end in training -> #1694

Seems related to #1565, but may be more widespread across models.

Going by the example which downloads alpaca finance:

litgpt finetune_full meta-llama/Meta-Llama-3.1-8B-Instruct \
  --config configs/llama31-8b.yaml \
  --data JSON \
  --data.json_path my_custom_dataset.json \
  --data.mask_prompt True \
  --data.prompt_style llama3 \
  --data.val_split_fraction 0.05

and adding this to full.py along with support for skip_special_tokens=False

        if fabric.global_rank == 0 and state["iter_num"] == 1:
            non_pad_ids = input_ids[0][input_ids[0] != 0] # assume pad token id is 0
            fabric.print(f"First row of input ids with total shape {input_ids.shape}: {non_pad_ids}")
            fabric.print(f"Detokenized: {tokenizer.decode(non_pad_ids, skip_special_tokens=False)}")

gives

First row of input ids with total shape torch.Size([4, 765]): tensor([128000, 128000, 128006,   9125, 128007,    271,   264, [...] 459,   9341,     13]
Detokenized: <|begin_of_text|><|begin_of_text|><|start_header_id|> [..] accurate valuation of an investment.

What operating system are you using?

Unknown

LitGPT Version

(close to) main

rasbt commented 3 weeks ago

Thanks for raising that. Need to investigate in the next few days

rasbt commented 3 weeks ago

When you mentioned

(close to) main

could you check the version? Asking because I don't think that skip_special_tokens is a valid argument.

sanderland commented 3 weeks ago

When you mentioned

(close to) main

could you check the version? Asking because I don't think that skip_special_tokens is a valid argument.

version = "0.4.10", but when I said

adding this to full.py along with support for skip_special_tokens=False

I meant I added that option to help debug.

rasbt commented 3 weeks ago

Ah yes, the reason why I was asking is that I was getting a

TypeError: Tokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'

and I was wondering where you applied this

sanderland commented 3 weeks ago

You can see my (somewhat messy) branch here: https://github.com/Lightning-AI/litgpt/compare/main...sanderland:dev?expand=1

rasbt commented 3 weeks ago

Ah thanks! I am still not understanding why this wouldn't work for me with a TypeError: Tokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'. Need to investigate more (maybe a version issue).

Anyways, I just double-checked the generate_example function, and the for a prompt

What food do llamas eat?

The actual prompt that is passed to the tokenizer looks like this during finetuning with the default Alpaca style:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.

### Response:

and then with the --data.prompt_style llama3 you were using:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Recommend a movie for me to watch during the weekend and explain the reason.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

So that part at least looks all ok to me.

sanderland commented 3 weeks ago

skip_special_tokens is a parameter in huggingface, but not in litgpt, I just added the pass-through to debug.

As for your prompt being correct, that doesn't mean the result of encode() is

from tokenizers import Tokenizer as HFTokenizer
processor = HFTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
processor.encode("prompt").ids # [128000, 41681] = "<|begin_of_text|>" , "prompt"

That is, there is a template which adds "<|begin_of_text|>" in the tokenizer.

sanderland commented 3 weeks ago

This is another confusing point https://github.com/Lightning-AI/litgpt/blob/main/litgpt/tokenizer.py#L91 The tokenizer has special logic to add a bos token to llama3, but both the huggingface tokenizer AND the template add one already. At least it checks so doesn't end up with 3.

calvintwr commented 1 week ago

Actually I am curious as to how finetuning can work now given https://github.com/Lightning-AI/litgpt/issues/1699

Lightning-AI / litgpt