delphi-suite / delphi

small language models training made easy
Apache License 2.0
9 stars 1 forks source link

text dataset is not shuffled before tokenization #152

Open jettjaniak opened 4 months ago

jettjaniak commented 4 months ago

not an issue for our stories-* suite as stories dataset is shuffled, but could be an issue for other datasets, because of concatenation during tokenization