Closed gouchangjiang closed 5 months ago
By the way, should we use balancing_loss when using megablocks?
Hi! We save the data required to compute the load balancing loss, rather than the load balancing loss itself, so that we can compute the LBL for all layers at once using batched_load_balaning_loss.
And yes, we highly recommending using load balancing losses training MoEs!
Hi, in the forward function of ParallelMLP, should we save directly the load_balancing_loss or a tuple of tokens_per_expert and scores? In other words, should line 428,
save_load_balancing_loss((tokens_per_expert, scores))
, be replaced bysave_load_balancing_loss(self.load_balancing_loss(tokens_per_expert, scores))
?