axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
8.03k stars 887 forks source link

More efficient sample packing #1492

Open dsesclei opened 7 months ago

dsesclei commented 7 months ago

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

Move packing to the preprocess step and add a slower but more efficient algorithm.

✔️ Solution

ReEvo is a way of using LLMs for generating heuristics to tackle optimization problems (similar to FunSearch.) They have an online bin packing example, which I ran using GPT-4 to see if we could get something better suited for the distribution seen in sample packing.

Here's a script to try it: https://gist.github.com/dsesclei/4fcf3763f07feaf67b4141429afb3fb8

It does very well compared to first-fit-decreasing from multipack.py. For teknium/OpenHermes-v2 with context size 4096, FFD gives 109138 bins while this returns 8.3% fewer at 100076 (the minimum possible being 100061.) This could have been cut down even further by having GPT-4 generate a heuristic specifically for that dataset, though it doesn't seem worth it at that point.

Packing this way is much slower - OpenHermes took about five hours - so it would need to be moved to the preprocess step.

If this sounds good, I'm happy to implement - just wanted to open a discussion first.

❓ Alternatives

We can try generating a faster heuristic, although we'd probably want to put anything short of near-instant into preprocessing anyway.

📝 Additional Context

No response

Acknowledgements

winglian commented 7 months ago

Thanks for this @dsesclei ! Yes, I think an offline packing would be a valuable option for users, especially if it provides an 8% improvement in cost/time (since this can all be done on CPU). One of the things to consider is that we had an offline packing once before, but one of the problems is that most users wanted shuffling of the data between epochs. I think this can be solved with including the epoch index as part of the seed, but we would need to do some dataset/dataloader reloading mid-training to load the next packed dataset for a given epoch.

Let me know how I can help. Thanks!