astramind-ai / Mixture-of-depths

Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
129 stars 7 forks source link

recalculate capacity question #4

Closed starsholic closed 4 months ago

starsholic commented 6 months ago

Hi! Thanks for your reproduction of MoD. I wonder why recalculate capacity here? https://github.com/astramind-ai/Mixture-of-depths/blob/103aa4b6c211346599cc8b853cc3108bf9cb72d0/MoD/MoD.py#L45

taehyunzzz commented 5 months ago

I think they are trying to train the MoD model by smoothly converting from the no-skipping version to a MoD model over the course of training iterations.

mlinmg commented 4 months ago

In the paper they slowly reduce the capacity to not disrupt too much the orginal model. You can think it like a sort of warm-up, we saw empirically it gave much more stability too