Just to see the diff - Githubissues

loubnabnl commented 1 year ago

Summary of the changes I added:

fix device & size mismathes, (e.g len(dataloader) doesn't work on iterable dataset + take into account the number of processes)
in transformers the forward pass of GPT2 returns the average loss over the entire batch, and the loss.repeat(batch_size) before calling accelerate.gather was just repeating that average so we would've ended up selecting the same value. I changed this line in gpt2_modeling of transformers to add reduction="none" (see requirements). This will return the loss of each token so average to get loss per sequence.
added sanity checks on the order of irreducible losses (this requires using the same batch size.. during their computation and when loading them)
after selecting a small batch size (32) the same batch would be in all workers so no need to gather values, one thing we will need to add is split the batch (32) to 2 or 3 equal chunks then do grad acc because 32 won't fit in one worker.

Muennighoff commented 1 year ago

after selecting a small batch size (32) the same batch would be in all workers so no need to gather values, one thing we will need to add is split the batch (32) to 2 or 3 equal chunks then do grad acc because 32 won't fit in one worker.

Amazing work - do you want me to add the last point you mentioned?

loubnabnl commented 1 year ago

You can add it you have time, otherwise I will add it later 🤗

Muennighoff commented 1 year ago

You can add it you have time, otherwise I will add it later 🤗

Done, but not tested. May have a bug 👻

bigcode-project / transformers

Just to see the diff #3