Open Muennighoff opened 1 year ago
- after selecting a small batch size (32) the same batch would be in all workers so no need to gather values, one thing we will need to add is split the batch (32) to 2 or 3 equal chunks then do grad acc because 32 won't fit in one worker.
Amazing work - do you want me to add the last point you mentioned?
You can add it you have time, otherwise I will add it later 🤗
You can add it you have time, otherwise I will add it later 🤗
Done, but not tested. May have a bug 👻
Summary of the changes I added:
transformers
the forward pass of GPT2 returns the average loss over the entire batch, and theloss.repeat(batch_size)
before callingaccelerate.gather
was just repeating that average so we would've ended up selecting the same value. I changed this line in gpt2_modeling oftransformers
to add reduction="none" (see requirements). This will return the loss of each token so average to get loss per sequence.