astramind-ai / Mixture-of-depths

Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
120 stars 7 forks source link

Gradient propagation #10

Open NickNickGo opened 2 months ago

NickNickGo commented 2 months ago

Is Mod implementation end-end trainable ? There are several Ops like torch.topK , weights > threshold.unsqueeze(-1) which are feasible at inference , but are they feasible at training ?

xinlong-yang commented 1 month ago

Ops like torch.topK is used for making 'selected_mask', and the parameters in MoD router is trained and updated through this: image