ApolloResearch / rib

Library for methods related to the Local Interaction Basis (LIB)
MIT License
2 stars 0 forks source link

Switch roneneldan/TinyStories -> skeskinen/TinyStories-hf #342

Closed danbraunai-apollo closed 4 months ago

danbraunai-apollo commented 4 months ago

Switch roneneldan/TinyStories -> skeskinen/TinyStories-hf

Description

Motivation and Context

roneneldan/TinyStories has a bug - each document is split over multiple lines and separated by an eos string. But the load_dataset has no accompanying script to handle this logic, so we end up with a document split over multiple dataset samples, and several samples that are simply eos tokens.

How Has This Been Tested?

None

Does this PR introduce a breaking change?

Yes. A run of tinystories will now give different results.

danbraunai-apollo commented 4 months ago

It turns out that this is actually only an issue when you pass specific data_files to your load_dataset call, e.g.:

dataset = load_dataset("roneneldan/TinyStories", data_files=["TinyStories-train.txt"])

But it's not an issue when you don't pass data_files, which we don't in this repo.