allenai / bff

Apache License 2.0
37 stars 8 forks source link

Backport improvements from the Dolma repo #7

Open dirkgr opened 11 months ago

dirkgr commented 11 months ago

@chris-ha458 has made some great improvements to BFF in the https://github.com/allenai/dolma repo. We should back-port those changes here, especially the ones that have to do with correctness (like the ones involving the choice of hash functions).

Chris' PRs are here:

They won't apply 1:1, because things have changed in the Dolma repo, but at least the important things should carry over.

chris-ha458 commented 11 months ago

From what has been communicated to me in discord, the plan is to encapsulate and refactor out the implementation within dolma to use this crate.

If this is correct, I'll try to bring the work here myself as well (a bit busy so cannot commit on any specifics more than that)

In the meantime I'll share how I approached and prepared the original PRs.

I tended to follow the hints from cargo clippy and when everything was clear there, cargo clippy -- -Wclippy::pedantic pedantic can potentially lead to more false positives, but it is still useful to understand whether they are indeed false positives and why. If it is indeed spurious, we can document and allow the lints individually or crate wide.

I also tried to maximize refactoring and modularization, which for me is helpful to understand and lint against.

Some more semantic-syntactic changes I was looking for was investigating the relevant mathematic formula and Rust code, taking into detail how they match and differ. Making clear what was or could be done for code readability and performance was helpful.

Supporting all of this is of course more tests, and I think adding tests here in an organic manner could be the first step forward, both enhancing test coverage here and allowing for more dynamic and robust refactors and fixes.