selective router precision

To my understanding -- and please correct me if I am wrong about this -- there is no mechanism to selectively compute routing logits in fp32, as is suggested in e.g. switch transformers. Basis:

The only mention of fp32/float computations I see anywhere are for moe_lbl_in_fp32
the router is initialized with the same dtype as the MLP weights (as configured by Arguments).
There does not seem to be any explicit casting || autocast deactivation in router.py, nor any attempt to do so in dMoE
Given that the router is implemented as a torch.nn.Linear, and the input to the router is pre-casted to autocast's precision, I can only presume that the computation must be done in half precision under normal AMP training.

Is this correct? If so, have you observed any instabilities in practice during training? Perhaps it is just not necessary...

databricks / megablocks

selective router precision #91