Open Ethan-Chen-plus opened 1 week ago
Thank you for your feedback. Regarding your concern, we chose to show only layers 1 and 28 mainly due to space constraints and to maintain clarity in the paper. These layers were selected to represent the early and late stages of the model. However, I can assure you that, due to our shared router and masked learning design, the routed image tokens remain consistent, ensuring that the most important parts of the images are preserved in all layers.
As you pointed out, the router optimization process is not yet optimal, which presents an exciting research opportunity for future work. Both the learning objective designed for the router and the router's architecture itself could be improved. These limitations may contribute to some randomness in handling text tokens, particularly during the inference phase for next-token generation. Currently, the router decides whether to skip a single token based on a threshold, but the probability distribution of the router’s outputs is still suboptimal.
We hope this clarifies our approach and inspires further contributions from the community toward efficient MLLM design. Ideally, with progressive incremental improvements, we envision a router capable of fully skipping all redundant tokens in the attention matrix while preserving all high-attention-score tokens.
Thank you for your detailed explanation. However, I still have a question: while the routing on the image level remains unchanged and only the routing on the text level is altered, is this reflected in the code? Thank you very much.
Sorry for the late reply; when in inference, routing tokens based on a threshold. So, the tokens whose probability(output from the router and then softmax) is less than that threshold will be routed. It turns out most of the routed image tokens will remain consistent as we use the same router for every layer of the MLLM.
Thank you for your insightful work on γ-MoD. I have a question regarding Figure 4 in your paper.
Could you please explain why only the visualizations of Layer 1 and Layer 28 are shown? Additionally, I noticed that within the layers from 1 to 28, the token routing for images remains largely unchanged, while there is significant variation in the token routing for textual data. Could you kindly elaborate on the reasoning behind this difference in routing behavior between image and text tokens?
Thank you for your time and consideration. I look forward to your reply!