Closed Taytay closed 9 months ago
@Taytay Again, please accept my apologies for the late reply :)
do you see any reason why this wouldn't be the case?
No, can't see any reason :)
does your loss curve flatten after 16 hours no matter how many more tokens you throw at it?
Nope either. I've tested training the model for longer and the loss continued to go down very nicely. It's true that Rouge-L on SNI is the same after 20 and 24H, but the validation loss after 20 and 24H is much better than after 16H. It could be the case that Rouge-L on SNI caps at 41 for this model size but I doubt it - SNI is still quite small dataset, their proposed recipe for fine-tuning is 2 epochs and it's very easy to overfit.
My bet is that if you instead evaluate on the entire Flan Collection which is way larger than 24H > 20H > 16H.
If you are interested in doing TinyT5 - like endeavour and you would like some help then feel free to reach out - I'd be very interested :)
Oh cool - I'll give this some thought!
ElutherAI is cooking up a modern T5 right now as well apparently.
https://huggingface.co/collections/EleutherAI/pile-t5-65a76a0d0022dd270b385a66
https://github.com/EleutherAI/improved-t5
They haven't said much about it other than confirming on Twitter they are working on it. I continue to be of the opinion that someone is going to get SOTA results with a T5 model and some modern techniques.
This is very interesting
I was first tipped off to it here: https://x.com/andersonbcdefg/status/1750570453532577883?s=20
Their confirmation in this thread: https://x.com/thetaytay/status/1753780417365250199?s=20
And you know, if we WERE to do a "TinyModernT5" effort, this paper also points to a way to reduce PreTraining costs by 40-50% by changing the objective and learning rate schedule:
In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.
From "SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection"
One more thing to watch: In this comment, @b-albar claims that he has a custom T5 implementation with his FlashAttention patch and a "few other tricks": https://github.com/huggingface/transformers/issues/26350#issuecomment-1864855179
He said that he is considering open sourcing it. 🤞 ❤️
After seeing the excitement around TinyLlama, it makes me want to pre train some T5 models in a similar fashion. If you are able to achieve these equivalent results in fraction of the time on C4, it seems like throwing some modern datasets and more compute at it should yield even better results...do you see any reason why this wouldn't be the case? Or does your loss curve flatten after 16 hours no matter how many more tokens you throw at it?