chenjiasheng / mwer

mWER loss implementation in tensorflow
31 stars 4 forks source link

Question #2

Open jtkim-kaist opened 4 years ago

jtkim-kaist commented 4 years ago

It seems that your mwer loss implementation needs prior beam search for inputs for mwer_loss function.

We can get 'seq_logprobs' during beam search for each hypothesis, however, your implementation seems to re-compute this 'seq_logprobs' using logprob with some tokens found from prior beam search. Is there any reason for this re-computation?

jtkim-kaist commented 4 years ago

Also, your weighted_relative_edit_error includes the information about ground truth, not n-best only.

chenjiasheng commented 3 years ago

Sorry for the late.

"It seems that your mwer loss implementation needs prior beam search for inputs for mwer_loss function." --- Yes, I implemented the 2nd approch described in paper: https://arxiv.org/pdf/1712.01818.pdf

There are two possible approximations which ensure tractability: 1, approximating the expectation with samples, or 2, restricting the summation to an N-best list, as is commonly done during sequence training for ASR.

"We can get 'seq_logprobs' during beam search for each hypothesis, however, your implementation seems to re-compute this 'seq_logprobs' using logprob with some tokens found from prior beam search. Is there any reason for this re-computation?" --- Of course you can calculate seq_logprobs outside the loss function. For me, I don't do beam search to get seq_logprobs outside the loss function, so it is NOT 'RE-compuation' for myself. Let me mention, that, there are 2 modes of pipelines to apply mWER training:.

  1. Offline Mode: 1). Do beam search over the whole train set before mWER training. Save hypotheses without seq_probs. 2). Apply mWER fine-tune training on saved hypotheses. We won't do beam search to refresh top-n hypotheses during mWER training. Hypotheses remains unchanged, and we just adjust their relative weight, i.e., renormalized_seq_probs.
  2. On-the-fly Mode: Do beam search on-the-fly on each batch to get refreshed top-n hypotheses and train mWER on them.

Since beam search is very computational expensive, what I have done in practice is the Offline Mode.

"Also, your weighted_relative_edit_error includes the information about ground truth, not n-best only." -- You can't calculate WER without ground truth. You may argue that we only need the number of word errors instead of the ground truth. As comments said, the ground truth is only used to calculate CE loss.

N: the number of candidate sequences (i.e. hypothesis sequences) plus 1. the last sequence is treated as the ground truth and used to compute ce loss.