bclarkson-code / Tricycle

Autograd to GPT-2 completely from scratch
104 stars 7 forks source link

Memory efficient shuffle #86

Open bclarkson-code opened 1 month ago

bclarkson-code commented 1 month ago

The current method for shuffling datasets is np.random.shuffle. For Fineweb, this uses ~48Gb of RAM which makes training GPT-2 unapproachable.

This should probably be replaced with a generator that generates non-repeating integers in a range. This would allow the entire dataset to be loaded lazily, dramatically reducing ram requirements.