Closed danbraunai-apollo closed 4 months ago
It turns out that this is actually only an issue when you pass specific data_files to your load_dataset call, e.g.:
dataset = load_dataset("roneneldan/TinyStories", data_files=["TinyStories-train.txt"])
But it's not an issue when you don't pass data_files, which we don't in this repo.
Switch roneneldan/TinyStories -> skeskinen/TinyStories-hf
Description
Motivation and Context
roneneldan/TinyStories has a bug - each document is split over multiple lines and separated by an eos string. But the load_dataset has no accompanying script to handle this logic, so we end up with a document split over multiple dataset samples, and several samples that are simply eos tokens.
How Has This Been Tested?
None
Does this PR introduce a breaking change?
Yes. A run of tinystories will now give different results.