Loss's grad_fn in the original code was an empty object. This meant that the back-propagation didn't actually do anything.
In the proposed loss changes branch, loss now has a non-empty grad_fn. However, I do not think that the grad_fn contains all the objects it needs to update. If grad_fn is complete, then the model's parameters should update. I have not yet tested this.
Additionally, the grad_fn makes the 0 prediction issue significantly more difficult. Only tensors created by the model layers will have the correct grad_fn, and grad_fn is needed for backpropagation. Maybe force the model to produce a minimum number of predictions during training?
Loss's grad_fn in the original code was an empty object. This meant that the back-propagation didn't actually do anything.
In the proposed loss changes branch, loss now has a non-empty grad_fn. However, I do not think that the grad_fn contains all the objects it needs to update. If grad_fn is complete, then the model's parameters should update. I have not yet tested this.
Additionally, the grad_fn makes the 0 prediction issue significantly more difficult. Only tensors created by the model layers will have the correct grad_fn, and grad_fn is needed for backpropagation. Maybe force the model to produce a minimum number of predictions during training?