Tokenization mapping for Falcon fine tuning notebook is done wrong

aws-samples / amazon-sagemaker-generativeai

Repository for training and deploying Generative AI models, including text-text, text-to-image generation and prompt engineering playground using SageMaker Studio.

MIT No Attribution

130 stars 88 forks source link

Tokenization mapping for Falcon fine tuning notebook is done wrong #17

Closed wmlba closed 1 year ago

wmlba commented 1 year ago

dataset.map() function maps a function to apply it to every record in the dataset. You do not apply a function here, you pass sample['text'] and consequently all your tokens you use for fine-tuning are exactly the same

lm_train_dataset = train_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, batch_size=32, remove_columns=list(train_dataset.features)
)

lm_test_dataset = test_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(test_dataset.features)
)

seanpmorgan commented 1 year ago

Hi Will,

There is an applied function. lambda sample: tokenizer(sample["text"]). It's an anynonmous function . It might be clearer if we wrote it as lambda x: tokenizer(x["text"]).

If you sample after the mapping and look at the first few tokens it does look like they're all the same, but that's because we previously applied a template that proceeds each sample with Summarize the chat dialogue:.

If we compare the ends of the tokens from two different samples you can see that they're not the same: