allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
910 stars 95 forks source link

Text modification config #60

Open rodneykinney opened 11 months ago

rodneykinney commented 11 months ago

Add mixer configuration to trim trailing/leading whitespace from document text, and enforce a minimum document text length. Place these into a new text_modification config object, and move the span_replacements config into it.

@soldni any objections to this backward-incompatible change to config structure?

rodneykinney commented 11 months ago

Not sure what's happening with automated tests. Maybe timing out?

make test passes locally, except for the test_download_file Rust test, which also fails on the main branch.