UMass-Foundation-Model / Mod-Squad

Other
75 stars 6 forks source link

why compute top k gate do not use noisy #8

Closed zhuyuedlut closed 3 months ago

zhuyuedlut commented 4 months ago

image

tankche1 commented 4 months ago

In practice, we don't find much difference whether we add noise or not.

zhuyuedlut commented 3 months ago

By the way, I find about loss, you do not use switchloss and zloss even you compute them