Closed dxiao2003 closed 5 years ago
Well...The loss is just a metric, it doesn't change the optimisation. The question is whether it's more useful to report the loss per batch, or the mean loss per example. I think the latter is more useful, especially if the batch size is varying during training.
Hi, yes I agree with you on both points, that it does not affect optimization in the SGD step and the per-example loss if more useful.
However the bug still presents a problem because:
Code from 2.0 that uses the loss to determine when to stop training early because loss is sufficiently small will now misbehave in 2,1 because the calculation for loss is different.
The new code does not compute loss per example, it computes loss per example / batch size2 because there's a factor of ``scores.shape[0]2in the denominator from the definition of
d_scores``.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
The following line is incorrect because
d_scores = (scores - truths) / scores.shape[0]
and so when it is squared you get an extra factor ofscores.shape[0]
in the denominator.https://github.com/explosion/spaCy/blob/89379a7fa45f94bce4945284a7781eaaa7bc06ff/spacy/pipeline/pipes.pyx#L962
Suggest changing back to the previous version