Closed wmlba closed 1 year ago
Hi Will,
There is an applied function. lambda sample: tokenizer(sample["text"])
. It's an anynonmous function . It might be clearer if we wrote it as lambda x: tokenizer(x["text"])
.
If you sample after the mapping and look at the first few tokens it does look like they're all the same, but that's because we previously applied a template that proceeds each sample with Summarize the chat dialogue:
.
If we compare the ends of the tokens from two different samples you can see that they're not the same:
dataset.map() function maps a function to apply it to every record in the dataset. You do not apply a function here, you pass
sample['text']
and consequently all your tokens you use for fine-tuning are exactly the same