Lightning-AI / lit-llama

Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
Apache License 2.0
6k stars 520 forks source link

`PackedDatasetBuilder` does not separate with `sep_token` #482

Open calvintwr opened 8 months ago

calvintwr commented 8 months ago

I noticed that PackedDatasetBuilder does not separate the tokens with sep_token.

To illustrate, referencing https://github.com/Lightning-AI/lit-llama/blob/da71adea0970d6d950fb966d365cfb428aef8298/scripts/prepare_redpajama.py#L71

        builder = packed_dataset.PackedDatasetBuilder(
            outdir=destination_path,
            prefix=prefix,
            chunk_size=chunk_size,
            sep_token=tokenizer.bos_id,
            dtype="auto",
            vocab_size=tokenizer.vocab_size,
        )

and https://github.com/Lightning-AI/lit-llama/blob/da71adea0970d6d950fb966d365cfb428aef8298/scripts/prepare_redpajama.py#L85

text_ids = tokenizer.encode(text)

The minimal reproducible code is as follows:

from pathlib import Path
import numpy as np
from lit_gpt.tokenizer import Tokenizer
from lit_gpt.packed_dataset import PackedDatasetBuilder

tokenizer = Tokenizer(Path('tokenizer'))

content = 'foo'

tokenized = tokenizer.encode(content)

print(tokenized)
# prints:
# tensor([7953,    2], dtype=torch.int32)

training_dataset_builder = PackedDatasetBuilder(
    outdir='FOO',
    # Use process_id to differentiate builders
    prefix='BAR',
    chunk_size=6,
    sep_token=tokenizer.bos_id,
    dtype="auto",
    vocab_size=tokenizer.vocab_size,
)

training_dataset_builder.add_array(np.array(tokenized))
print(training_dataset_builder._arr)
# prints:
# [7953    2    1    1    1    1]

training_dataset_builder.add_array(np.array(tokenized))
print(training_dataset_builder._arr)
# prints:
# [7953    2 7953    2    1    1]

1 represents the bos token. 2 represents the eos token.

As you can see, this translates to:

foo</s>foo</s><s><s>

Shouldn't the foo's be wrapped in bos and eos tokens, like this?

# Tensor
[1    7953    2     1    7953    2 ]

# Plain text
<s>foo</s><s>foo</s>