Open sanderland opened 3 weeks ago
Thanks for raising that. Need to investigate in the next few days
When you mentioned
(close to) main
could you check the version? Asking because I don't think that skip_special_tokens
is a valid argument.
When you mentioned
(close to) main
could you check the version? Asking because I don't think that
skip_special_tokens
is a valid argument.
version = "0.4.10", but when I said
adding this to full.py along with support for skip_special_tokens=False
I meant I added that option to help debug.
Ah yes, the reason why I was asking is that I was getting a
TypeError: Tokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'
and I was wondering where you applied this
You can see my (somewhat messy) branch here: https://github.com/Lightning-AI/litgpt/compare/main...sanderland:dev?expand=1
Ah thanks! I am still not understanding why this wouldn't work for me with a TypeError: Tokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'
. Need to investigate more (maybe a version issue).
Anyways, I just double-checked the generate_example
function, and the for a prompt
What food do llamas eat?
The actual prompt that is passed to the tokenizer looks like this during finetuning with the default Alpaca style:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.
### Response:
and then with the --data.prompt_style llama3
you were using:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Recommend a movie for me to watch during the weekend and explain the reason.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
So that part at least looks all ok to me.
skip_special_tokens
is a parameter in huggingface, but not in litgpt, I just added the pass-through to debug.
As for your prompt being correct, that doesn't mean the result of encode() is
from tokenizers import Tokenizer as HFTokenizer
processor = HFTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
processor.encode("prompt").ids # [128000, 41681] = "<|begin_of_text|>" , "prompt"
That is, there is a template which adds "<|begin_of_text|>" in the tokenizer.
This is another confusing point https://github.com/Lightning-AI/litgpt/blob/main/litgpt/tokenizer.py#L91 The tokenizer has special logic to add a bos token to llama3, but both the huggingface tokenizer AND the template add one already. At least it checks so doesn't end up with 3.
Actually I am curious as to how finetuning can work now given https://github.com/Lightning-AI/litgpt/issues/1699
Bug description
When finetuning Llama3, the encoded data has:
Seems related to #1565, but may be more widespread across models.
Going by the example which downloads alpaca finance:
and adding this to full.py along with support for
skip_special_tokens=False
gives
What operating system are you using?
Unknown
LitGPT Version
(close to) main