Open 152334H opened 8 months ago
Hi! No, we don't support their selective precision. Although it would be quite easy to add if you wanted to try it!
In practice we haven't had any issues with router instability, although that paper trains models much larger (and with different systems/software) than what we have. If you're training models with FLOPs equivalent to dense models of 10B parameters or less I suspect you will be fine, based on our experience.
To my understanding -- and please correct me if I am wrong about this -- there is no mechanism to selectively compute routing logits in fp32, as is suggested in e.g. switch transformers. Basis:
moe_lbl_in_fp32
Is this correct? If so, have you observed any instabilities in practice during training? Perhaps it is just not necessary...