Open peterjc123 opened 5 months ago
That sounds reasonable to me! We'd be happy to take a PR if you'd like to fix this in our Megatron-LM fork!
@tgale96 I've put up a fix. Please take a look when you have time. https://github.com/stanford-futuredata/Megatron-LM/pull/6
The loss func is always
moe_loss_func
as can be seen here. But the loss is only calculated when training, which can be seen here. We should fallback to the original loss func during evaluation.