More efficient sample packing

⚠️ Please check that this feature request hasn't been suggested before.

[X] I searched previous Ideas in Discussions didn't find any similar feature requests.
[X] I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Move packing to the preprocess step and add a slower but more efficient algorithm.

✔️ Solution

ReEvo is a way of using LLMs for generating heuristics to tackle optimization problems (similar to FunSearch.) They have an online bin packing example, which I ran using GPT-4 to see if we could get something better suited for the distribution seen in sample packing.

Here's a script to try it: https://gist.github.com/dsesclei/4fcf3763f07feaf67b4141429afb3fb8

It does very well compared to first-fit-decreasing from multipack.py. For teknium/OpenHermes-v2 with context size 4096, FFD gives 109138 bins while this returns 8.3% fewer at 100076 (the minimum possible being 100061.) This could have been cut down even further by having GPT-4 generate a heuristic specifically for that dataset, though it doesn't seem worth it at that point.

Packing this way is much slower - OpenHermes took about five hours - so it would need to be moved to the preprocess step.

If this sounds good, I'm happy to implement - just wanted to open a discussion first.

❓ Alternatives

We can try generating a faster heuristic, although we'd probably want to put anything short of near-instant into preprocessing anyway.

📝 Additional Context

No response

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this feature has not been requested yet.
[X] I have provided enough information for the maintainers to understand and evaluate this request.

axolotl-ai-cloud / axolotl