@tankche1 Hi, I also notice P(T_i, E_j) computation use probs and probs. I do not understand use probs to compute self.MI_task_gate。
I think probs.sum(0) means the frequency of the experts selected by task, could also It can be used as an approximation of P(T_i, E_j) , so why do we still need to divide by tot ? I guess may this division could help training?
@tankche1 Hi, I also notice P(T_i, E_j) computation use probs and probs. I do not understand use probs to compute self.MI_task_gate。 I think probs.sum(0) means the frequency of the experts selected by task, could also It can be used as an approximation of P(T_i, E_j) , so why do we still need to divide by tot ? I guess may this division could help training?