Memory efficient shuffle

The current method for shuffling datasets is np.random.shuffle. For Fineweb, this uses ~48Gb of RAM which makes training GPT-2 unapproachable.

This should probably be replaced with a generator that generates non-repeating integers in a range. This would allow the entire dataset to be loaded lazily, dramatically reducing ram requirements.

bclarkson-code / Tricycle

Memory efficient shuffle #86