allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
894 stars 90 forks source link

Issue with ring tokenizer #175

Closed davidbrandfonbrener closed 1 month ago

davidbrandfonbrener commented 2 months ago

This line seems to throw an error when ring_size < len(source_paths) for division by 0.

Basically it seems that len(tokenizer_ring) will be decremented here. The inner loop is broken, but the outer loop keeps going and divides by 0.

I'm not exactly sure what the right fix is, and it seems things work fine as long as ring_size * processes >= num_files. Any clarity here would be appreciated, thanks!

soldni commented 1 month ago

Hey @davidbrandfonbrener, thank you for the report! I think the right behavior should be to reduce ring size so that the issue doesn't happen. For now, please adjust size manually.