Closed sove45 closed 6 months ago
fortunately,we didn't met the 0-token matrix during the train phase
This part of the code is inherited from this MMoE work and is fine for practical use and in theory. It may take a few more readings and then running test cases outputting intermediate results to understand this code
This part of the code is inherited from this MMoE work and is fine for practical use and in theory. It may take a few more readings and then running test cases outputting intermediate results to understand this code
"I'm sorry, this is my mistake. I noticed that the appearance of the NaN matrix was due to gradient explosion during the training, not because of differentiation leading to a None matrix."
excuse me , I noticed the code in MMOE.py in line 310 expert_outputs = [self.expertsi for i in range(self.num_experts)] this means we will compute 4 experts even if the expert_inputs contains zero-token matrix? (Theoretically, it is possible for us to get a 0-token matrix in expert_inputs)