allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
894 stars 90 forks source link

Add an option to improve tokenization shuffling #141

Closed soldni closed 5 months ago

soldni commented 5 months ago

added flag --sample_ring_prop to dolma tokens (off by default). When set to true, sampling from files in the ring is done proportionally to their size so that smaller files are not disproportionately represented in the beginning of each file.