Larger models and training on the Pile

PiotrNawrot / nanoT5

Fast & Simple repository for pre-training and fine-tuning T5-style models

Apache License 2.0

971 stars 74 forks source link

Larger models and training on the Pile #29

Closed Taytay closed 9 months ago

Taytay commented 10 months ago

After seeing the excitement around TinyLlama, it makes me want to pre train some T5 models in a similar fashion. If you are able to achieve these equivalent results in fraction of the time on C4, it seems like throwing some modern datasets and more compute at it should yield even better results...do you see any reason why this wouldn't be the case? Or does your loss curve flatten after 16 hours no matter how many more tokens you throw at it?

PiotrNawrot commented 9 months ago

@Taytay Again, please accept my apologies for the late reply :)

do you see any reason why this wouldn't be the case?

No, can't see any reason :)

does your loss curve flatten after 16 hours no matter how many more tokens you throw at it?

Nope either. I've tested training the model for longer and the loss continued to go down very nicely. It's true that Rouge-L on SNI is the same after 20 and 24H, but the validation loss after 20 and 24H is much better than after 16H. It could be the case that Rouge-L on SNI caps at 41 for this model size but I doubt it - SNI is still quite small dataset, their proposed recipe for fine-tuning is 2 epochs and it's very easy to overfit.

My bet is that if you instead evaluate on the entire Flan Collection which is way larger than 24H > 20H > 16H.

If you are interested in doing TinyT5 - like endeavour and you would like some help then feel free to reach out - I'd be very interested :)

Taytay commented 9 months ago

Oh cool - I'll give this some thought!

ElutherAI is cooking up a modern T5 right now as well apparently.

https://huggingface.co/collections/EleutherAI/pile-t5-65a76a0d0022dd270b385a66

https://github.com/EleutherAI/improved-t5

They haven't said much about it other than confirming on Twitter they are working on it. I continue to be of the opinion that someone is going to get SOTA results with a T5 model and some modern techniques.

PiotrNawrot commented 9 months ago

This is very interesting

Taytay commented 9 months ago

I was first tipped off to it here: https://x.com/andersonbcdefg/status/1750570453532577883?s=20

Their confirmation in this thread: https://x.com/thetaytay/status/1753780417365250199?s=20

And you know, if we WERE to do a "TinyModernT5" effort, this paper also points to a way to reduce PreTraining costs by 40-50% by changing the objective and learning rate schedule:

In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.

From "SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection"

Taytay commented 9 months ago

One more thing to watch: In this comment, @b-albar claims that he has a custom T5 implementation with his FlashAttention patch and a "few other tricks": https://github.com/huggingface/transformers/issues/26350#issuecomment-1864855179

He said that he is considering open sourcing it. 🤞 ❤️