In Lab1_MoE, how can the model be trained end-to-end?

c8241998 commented 3 years ago

In Lab1_MoE, I notice that there is a switch structure in Gates of the MoE network, so how can this be trained end-to-end? Is this structure differentiable? Also, can you give us more explanation details about how to connect the overall network?

cwfparsonson commented 3 years ago

Hi,

Thanks for raising this issue.

1) The MoE architecture we've built is still composed of neural networks, so yes it is still differentiable. Of course, as you increase the complexity of the model by adding more layers and NNs, it will become increasingly difficult to differentiate & track how each parameter in the model impacts the final loss. Bare in mind that by the time we implement the switch, we have already trained & frozen the parameters of the baseline, the experts, and the expert binary gate classifiers, so we are only training the expert sub-gates. Even though we are switching which expert is chosen, the sub-gate being trained will always have the same expert, since the chosen expert determines which expert sub-gate is used for a given instance.

2) I have added some additional markdown cells towards the end of the MoE notebook below the sub-heading 'Connecting the overall networks to form the mixture of experts model' to better explain what is happening and how each part of the model is being connected. Can you please have a look at these new comments and let me know if they are helpful?

In the subGate() defintion, there was previously a numExperts argument being taken which I think was incorrect and confusing. This part of the code has been updated such that the output of a sub-gate is always an (orig_classes, 2) (correct) tensor rather than an (orig_classes, numExperts) (incorrect) tensor.

As a useful exercise, I have also added two (commented) sections of code beneath '# DEBUG' comments which, if uncommented, print the dimensions of the tensors being analysed so you can see for yourself what is happening. Uncommenting these lines will raise an Exception, but will show you the tensors you're dealing with and so might help with your understanding.

Best wishes, Chris

c8241998 commented 3 years ago

Your explanation helps me a lot! Thanks for your kind reply!

I also notice that there is a similar structure called MMoE which is applied in the recommendation system in 2018, and is this the first paper proposing MoE structure? If not, could you please tell me the original paper, thank you!

cwfparsonson commented 3 years ago

Hi,

I believe the first MoE model was proposed a long time ago in 1991 by Jacobs et al. (https://ieeexplore.ieee.org/document/6797059). Things have moved on since then. The model in the lab is similar to that of Shazeer et al. (https://arxiv.org/pdf/1701.06538.pdf) and Abbas et al. (https://discovery.ucl.ac.uk/id/eprint/10107986/), although not identical.

Best wishes, Chris

c8241998 commented 3 years ago

Thank you very much!

cwfparsonson / AMLS_II

In Lab1_MoE, how can the model be trained end-to-end? #4