Closed starsholic closed 4 months ago
I think they are trying to train the MoD model by smoothly converting from the no-skipping version to a MoD model over the course of training iterations.
In the paper they slowly reduce the capacity to not disrupt too much the orginal model. You can think it like a sort of warm-up, we saw empirically it gave much more stability too
Hi! Thanks for your reproduction of MoD. I wonder why recalculate capacity here? https://github.com/astramind-ai/Mixture-of-depths/blob/103aa4b6c211346599cc8b853cc3108bf9cb72d0/MoD/MoD.py#L45