Backport improvements from the Dolma repo

From what has been communicated to me in discord, the plan is to encapsulate and refactor out the implementation within dolma to use this crate.

If this is correct, I'll try to bring the work here myself as well (a bit busy so cannot commit on any specifics more than that)

In the meantime I'll share how I approached and prepared the original PRs.

I tended to follow the hints from cargo clippy and when everything was clear there, cargo clippy -- -Wclippy::pedantic pedantic can potentially lead to more false positives, but it is still useful to understand whether they are indeed false positives and why. If it is indeed spurious, we can document and allow the lints individually or crate wide.

I also tried to maximize refactoring and modularization, which for me is helpful to understand and lint against.

Some more semantic-syntactic changes I was looking for was investigating the relevant mathematic formula and Rust code, taking into detail how they match and differ. Making clear what was or could be done for code readability and performance was helpful.

Supporting all of this is of course more tests, and I think adding tests here in an organic manner could be the first step forward, both enhancing test coverage here and allowing for more dynamic and robust refactors and fixes.

allenai / bff

Backport improvements from the Dolma repo #7