I'm a newcomer for mixture of depth, and I do think it's really innovative work in MMLM community.
However, there are some questions when I read your paper, could you give me some explanation for these professional vocabularies in your paper?
routing ratio: there seems no description for it in your method section, what is the difference between routing ratio and skip ratio?
In table 1, I can't understand why the skip ratio of ARank-based deployment is even higher than all layers. As shown in Figure2, ARank-based deployment only replace some Dense Transformer Layers with MoD Transformer Layers and keep some Dense Transformer Layers unchanged, but all layers replace some Dense Transformer Layers with MoD Transformer Layers. Thus I'm really confused why ARank-based deployment has higher skip ratio than all layers.
Hi, thanks for your attention.
Here are my suggestions for your questions:
The routing and skip ratios are the same vocabulary to describe how many tokens will not attend self-attention operations in the layers of LLM.
The reason why the skip ratio of ARank-based deployment is even higher than all layers is that we will optimize the router through training. If simply routing in all layers, the router can learn nothing useful, which in turn gives a low skip ratio in inference time, as in inference time, the routing ratio is not manually setted up.
I'm a newcomer for mixture of depth, and I do think it's really innovative work in MMLM community. However, there are some questions when I read your paper, could you give me some explanation for these professional vocabularies in your paper?