XueFuzhao / OpenMoE

A family of open-sourced Mixture-of-Experts (MoE) Large Language Models
1.39k stars 71 forks source link

tokens routing #8

Open wkml opened 8 months ago

wkml commented 8 months ago

thanks for your work! It is very valuable! I would like to know how you got your conclusion about token routing, since input is affected by attention and rope, it is not logical that there should be a fixed routing for each token, how should I reproduce your result about this part?

ilyalasy commented 8 months ago

Hey! Not a paper author here, but I'm currently working on reproducing the results of OpenMoe paper specificaly on token routing. Take a look: https://github.com/Misterion777/moe-routing/blob/main/notebooks/routing_eda.ipynb Would appreciate any collaboration!

Also would be grateful for a review from paper author @XueFuzhao whether what I'm doing makes sense.

wkml commented 8 months ago

Hello, I have received your email.

XueFuzhao commented 8 months ago

Thank you for your interest!! May I know whether your OpenMoE can generate readable sentences?

XueFuzhao commented 8 months ago

My analysis code is a bit dirty. But in general, the core code has been attached in this file: https://github.com/XueFuzhao/OpenMoE/blob/main/analysis/colossalai_replace/layer.py You can just compare the colossalai's class SparseMLP and mine and you will then get the difference.

I went through your code very quickly (sry I'm totally overwhelmed these days), my two concerns: 1) The context-independent specialization is not that clear? I am not sure whether the output sentence is normal. If not, the model may have some bugs. Maybe the ckpt loading is not that correct? Just to have a check on the model output. 2) In your hooker code, it seems that you are accounting for the argmax value directly? However, the routing decision depends on both argmax value and model capacity. So a more reliable implementation is to check the routing decision like this line: https://github.com/XueFuzhao/OpenMoE/blob/ad4c65cc5828721835c4b064504e16e81444e5d2/analysis/colossalai_replace/layer.py#L189

Thanks again for your interest! Looking forward to your results on other MoE models like Mistral and Deepseek-MoE. That would be very interesting.

wkml commented 8 months ago

Hey! Not a paper author here, but I'm currently working on reproducing the results of OpenMoe paper specificaly on token routing. Take a look: https://github.com/Misterion777/moe-experiments/blob/main/notebooks/routing_eda.ipynb Would appreciate any collaboration!

Also would be grateful for a review from paper author @XueFuzhao whether what I'm doing makes sense.

thanks for your code! I have encountered some tricky things recently, so I have spent less energy on advancing this research. I will study your code carefully and thank you for your efforts! Thank you all! @Misterion777 @XueFuzhao

ilyalasy commented 8 months ago

My analysis code is a bit dirty. But in general, the core code has been attached in this file: https://github.com/XueFuzhao/OpenMoE/blob/main/analysis/colossalai_replace/layer.py You can just compare the colossalai's class SparseMLP and mine and you will then get the difference.

I went through your code very quickly (sry I'm totally overwhelmed these days), my two concerns:

  1. The context-independent specialization is not that clear? I am not sure whether the output sentence is normal. If not, the model may have some bugs. Maybe the ckpt loading is not that correct? Just to have a check on the model output.
  2. In your hooker code, it seems that you are accounting for the argmax value directly? However, the routing decision depends on both argmax value and model capacity. So a more reliable implementation is to check the routing decision like this line: https://github.com/XueFuzhao/OpenMoE/blob/ad4c65cc5828721835c4b064504e16e81444e5d2/analysis/colossalai_replace/layer.py#L189

Thanks again for your interest! Looking forward to your results on other MoE models like Mistral and Deepseek-MoE. That would be very interesting.

I changed the hook - now it takes expert capacity into consideration. Besides, ColossalAI checkpoint is indeed buggy and doesn't output valid text. I am using OrionZheng/... instead. Now the plot looks much more similar to what you've reported in the paper. Thank you very much for your help!