JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.
MIT License
1.3k stars 100 forks source link

Suggestion : support Maximal Update Parameterization #10

Closed tfisher98 closed 1 year ago

tfisher98 commented 1 year ago

I have been playing with this on my local hardware which is somewhat smaller even than your paper's reference machines (GPU is GTX1080, 8GB). One thing that has become apparent is that there is a difficulty with investigating scaling of the model size (#heads, depth, etc.) in that substantially different hyperparameters are required for effective model calibration as the size is varied. There is a paper by Yang et. al. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (https://arxiv.org/abs/2203.03466) which addresses exactly this issue, and proposes some modifications to how hyper-parameters and initializations are specified to make good hyperparameter choice much more invariant across model size. I suggest incorporating their parameterization would be a very useful change. One thing it would allow is more rapid investigation with very small crammed models for initial exploration and then much easier scaling up to test things in the larger model context.

JonasGeiping commented 1 year ago

I'm aware of the muP line of work (and think it's great!). When I started this project though, it was not so clear how effective muP would be on the small scale of the cramming regime. The best muP results (to me), are those that appear when scaling a decently-sized model to super-large size. Yang et al. do also test a few smaller models (that could be considered "cramming-sized"), so maybe this is wrong impression on my part, but muP also applies to only a smaller subset of hyperparameters.

I am happy to merge pull-requests of a functional muP implementation.

JonasGeiping commented 1 year ago

Feel free to re-open this, if you want to discuss more!