In the notebook 03_backproplin_grad function for calculating gradient of the linear layer, gradient calculation for *w.g = (inp.unsqueeze(-1) out.g.unsqueeze(1)).sum(0) seems to way slower than dot production version w.g = inp.t() @ out.g**
Time complexity wise, both seem to have the same: O(m x n x p).
Is the performance gain due to the way both are implemented?
Note: This is not an actual issue, but it confused me a lot because of the huge performance difference in both the implementation; that's the reason I raised this as an issue.
In the notebook 03_backprop lin_grad function for calculating gradient of the linear layer, gradient calculation for *w.g = (inp.unsqueeze(-1) out.g.unsqueeze(1)).sum(0) seems to way slower than dot production version w.g = inp.t() @ out.g**
Time complexity wise, both seem to have the same: O(m x n x p).
Is the performance gain due to the way both are implemented?
Note: This is not an actual issue, but it confused me a lot because of the huge performance difference in both the implementation; that's the reason I raised this as an issue.