databricks / megablocks

Apache License 2.0
1.11k stars 154 forks source link

Change router weight norm from in-place #70

Closed sashaDoubov closed 6 months ago

sashaDoubov commented 6 months ago

I was seeing:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [8192, 2]], which is output 0 of DivBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

This change addresses that issue.