Yaxin9Luo / Gamma-MOD

Officail Repo of γ -MOD: Mixture-of-Depth Adaptation for Multimodal Large Language Models
https://yaxin9luo.github.io/gamma-mod-webpage/
Other
18 stars 1 forks source link

Questions for some professional vocabularies #4

Open White1973 opened 2 weeks ago

White1973 commented 2 weeks ago

I'm a newcomer for mixture of depth, and I do think it's really innovative work in MMLM community. However, there are some questions when I read your paper, could you give me some explanation for these professional vocabularies in your paper?

  1. routing ratio: there seems no description for it in your method section, what is the difference between routing ratio and skip ratio?
  2. In table 1, I can't understand why the skip ratio of ARank-based deployment is even higher than all layers. As shown in Figure2, ARank-based deployment only replace some Dense Transformer Layers with MoD Transformer Layers and keep some Dense Transformer Layers unchanged, but all layers replace some Dense Transformer Layers with MoD Transformer Layers. Thus I'm really confused why ARank-based deployment has higher skip ratio than all layers.
Yaxin9Luo commented 2 weeks ago

Hi, thanks for your attention. Here are my suggestions for your questions:

  1. The routing and skip ratios are the same vocabulary to describe how many tokens will not attend self-attention operations in the layers of LLM.
  2. The reason why the skip ratio of ARank-based deployment is even higher than all layers is that we will optimize the router through training. If simply routing in all layers, the router can learn nothing useful, which in turn gives a low skip ratio in inference time, as in inference time, the routing ratio is not manually setted up.