Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
Apache License 2.0
6k
stars
520
forks
source link
`PackedDatasetBuilder` does not separate with `sep_token` #482
I noticed that
PackedDatasetBuilder
does not separate the tokens withsep_token
.To illustrate, referencing https://github.com/Lightning-AI/lit-llama/blob/da71adea0970d6d950fb966d365cfb428aef8298/scripts/prepare_redpajama.py#L71
and https://github.com/Lightning-AI/lit-llama/blob/da71adea0970d6d950fb966d365cfb428aef8298/scripts/prepare_redpajama.py#L85
The minimal reproducible code is as follows:
1
represents the bos token.2
represents the eos token.As you can see, this translates to:
Shouldn't the foo's be wrapped in bos and eos tokens, like this?