UMass-Foundation-Model / Mod-Squad

Other
71 stars 5 forks source link

About mutual information computation #9

Open zhuyuedlut opened 2 months ago

zhuyuedlut commented 2 months ago

@tankche1 Hi, I also notice P(T_i, E_j) computation use probs and probs. I do not understand use probs to compute self.MI_task_gate。 I think probs.sum(0) means the frequency of the experts selected by task, could also It can be used as an approximation of P(T_i, E_j) , so why do we still need to divide by tot ? I guess may this division could help training?

image

tankche1 commented 2 months ago

That is just following the definition of the loss. Since tot is constant, I suppose they don't have much effect on training.