Performance difference in lin_grad function (elementwise multiplication vs. dot product)

In the notebook 03_backprop lin_grad function for calculating gradient of the linear layer, gradient calculation for *w.g = (inp.unsqueeze(-1) out.g.unsqueeze(1)).sum(0) seems to way slower than dot production version w.g = inp.t() @ out.g**

Time complexity wise, both seem to have the same: O(m x n x p).

Is the performance gain due to the way both are implemented?

Note: This is not an actual issue, but it confused me a lot because of the huge performance difference in both the implementation; that's the reason I raised this as an issue.

fastai / course22p2

Performance difference in lin_grad function (elementwise multiplication vs. dot product) #26