Open Chung-I opened 5 years ago
@Alexander-H-Liu I think this is a wonderful PR, can you merge it ASAP?
@Chung-I i notice that u used cross entropy in ocd_loss rather than KLdivergence( which is official in paper 'Optimal Completion Distillation for sequence learning') , is this PR a right implementation for ocd_loss? THX.
ocd_loss should be like this ?
optimal_probs = F.softmax(q_val / temp, dim=-1)
loss += ( optimal_probs * (torch.log(optimal_probs)- F.log_softmax(out_probs[b,:len_sample,:])) ).sum(dim=-1).mean()
Yes, as the paper indicated, the loss they used is KL divergence; however, when performing backprop in this scenario, the two losses are actually equivalent in terms of gradient calculation. Consider this: KL(p||q) = ʃ p(x) log [p(x)/q(x)] dx = ʃ p(x) log p(x) dx - ʃ p(x) log q(x) dx = H(p, q) - H(p) .
So H(p, q) - KL(p||q) = H(p) .
H(p), while varying according to different number of targets and different temperature τ, doesn't contribute to gradients: d KL(p||q) / d q = d [H(p, q) - H(p)] / d q = d H(p, q) / d q .
So the two losses are equivalent in backprop despite having different values.
But of course H(p, q) is not a divergence, since divergence requires D(p || q) = 0 if and only if p = q. H(p, q) = H(p) > 0, while KL(p||q) = 0, when p=q.
It's true that if you really want to see how much q differs from p, KL divergence is the right loss to use. But after communicating with Alex (the owner of the repo), we decided to just ignore the H(p) term and use H(p, q) .
Yes, as the paper indicated, the loss they used is KL divergence; however, when performing backprop in this scenario, the two losses are actually equivalent in terms of gradient calculation. Consider this: KL(p||q) = ʃ p(x) log [p(x)/q(x)] dx = ʃ p(x) log p(x) dx - ʃ p(x) log q(x) dx = H(p, q) - H(p) .
So H(p, q) - KL(p||q) = H(p) .
H(p), while varying according to different number of targets and different temperature τ, doesn't contribute to gradients: d KL(p||q) / d q = d [H(p, q) - H(p)] / d q = d H(p, q) / d q .
So the two losses are equivalent in backprop despite having different values.
But of course H(p, q) is not a divergence, since divergence requires D(p || q) = 0 if and only if p = q. H(p, q) = H(p) > 0, while KL(p||q) = 0, when p=q.
It's true that if you really want to see how much q differs from p, KL divergence is the right loss to use. But after communicating with Alex (the owner of the repo), we decided to just ignore the H(p) term and use H(p, q) .
i see ,THX for ur reply ! There is a question I would like to consult with you:Do we need to complete the backprob ourselves when designing a new loss? Recently I was trying to reproduce CTC(which used dynamic programming algorithms ) . Existing CTC repo such as baidu‘s warp-ctc not only realized the forward part, but also calculated the gradient by hand, but it seems we dont need to do so in ocd_loss , so i 'm confused,Should we calculate the gradient ourselves ?
I think PyTorch does automatic differentiation for you.
Baidu realized their own backward function because they want their own optimized version. (DeepSpeech2, Page 27)
implement Optimal Completion Distillation. add a new config named libri_ocd_example.yaml which enables ocd training. Not well tested. Might have bugs inside. temperature annealing not yet implemented. currently equals to 1e-8 (sharpest).