Alexander-H-Liu / End-to-end-ASR-Pytorch

This is an open source project (formerly named Listen, Attend and Spell - PyTorch Implementation) for end-to-end ASR implemented with Pytorch, the well known deep learning toolkit.
MIT License
1.19k stars 318 forks source link

Ocd #19

Open Chung-I opened 5 years ago

Chung-I commented 5 years ago

implement Optimal Completion Distillation. add a new config named libri_ocd_example.yaml which enables ocd training. Not well tested. Might have bugs inside. temperature annealing not yet implemented. currently equals to 1e-8 (sharpest).

Liangtaiwan commented 5 years ago

@Alexander-H-Liu I think this is a wonderful PR, can you merge it ASAP?

xingchensong commented 5 years ago

@Chung-I i notice that u used cross entropy in ocd_loss rather than KLdivergence( which is official in paper 'Optimal Completion Distillation for sequence learning') , is this PR a right implementation for ocd_loss? THX.

xingchensong commented 5 years ago

ocd_loss should be like this ? optimal_probs = F.softmax(q_val / temp, dim=-1)

loss += ( optimal_probs * (torch.log(optimal_probs)- F.log_softmax(out_probs[b,:len_sample,:])) ).sum(dim=-1).mean()

Chung-I commented 5 years ago

Yes, as the paper indicated, the loss they used is KL divergence; however, when performing backprop in this scenario, the two losses are actually equivalent in terms of gradient calculation. Consider this: KL(p||q) = ʃ p(x) log [p(x)/q(x)] dx = ʃ p(x) log p(x) dx - ʃ p(x) log q(x) dx = H(p, q) - H(p) .

So H(p, q) - KL(p||q) = H(p) .

H(p), while varying according to different number of targets and different temperature τ, doesn't contribute to gradients: d KL(p||q) / d q = d [H(p, q) - H(p)] / d q = d H(p, q) / d q .

So the two losses are equivalent in backprop despite having different values.

But of course H(p, q) is not a divergence, since divergence requires D(p || q) = 0 if and only if p = q. H(p, q) = H(p) > 0, while KL(p||q) = 0, when p=q.

It's true that if you really want to see how much q differs from p, KL divergence is the right loss to use. But after communicating with Alex (the owner of the repo), we decided to just ignore the H(p) term and use H(p, q) .

xingchensong commented 5 years ago

Yes, as the paper indicated, the loss they used is KL divergence; however, when performing backprop in this scenario, the two losses are actually equivalent in terms of gradient calculation. Consider this: KL(p||q) = ʃ p(x) log [p(x)/q(x)] dx = ʃ p(x) log p(x) dx - ʃ p(x) log q(x) dx = H(p, q) - H(p) .

So H(p, q) - KL(p||q) = H(p) .

H(p), while varying according to different number of targets and different temperature τ, doesn't contribute to gradients: d KL(p||q) / d q = d [H(p, q) - H(p)] / d q = d H(p, q) / d q .

So the two losses are equivalent in backprop despite having different values.

But of course H(p, q) is not a divergence, since divergence requires D(p || q) = 0 if and only if p = q. H(p, q) = H(p) > 0, while KL(p||q) = 0, when p=q.

It's true that if you really want to see how much q differs from p, KL divergence is the right loss to use. But after communicating with Alex (the owner of the repo), we decided to just ignore the H(p) term and use H(p, q) .

i see ,THX for ur reply ! There is a question I would like to consult with you:Do we need to complete the backprob ourselves when designing a new loss? Recently I was trying to reproduce CTC(which used dynamic programming algorithms ) . Existing CTC repo such as baidu‘s warp-ctc not only realized the forward part, but also calculated the gradient by hand, but it seems we dont need to do so in ocd_loss , so i 'm confused,Should we calculate the gradient ourselves ?

Chung-I commented 5 years ago

I think PyTorch does automatic differentiation for you.

Baidu realized their own backward function because they want their own optimized version. (DeepSpeech2, Page 27)