awni / transducer

A Fast Sequence Transducer Implementation with PyTorch Bindings
Apache License 2.0
196 stars 36 forks source link

always get nothing trying to use viterbi decode interface #10

Closed xiongjun19 closed 2 years ago

xiongjun19 commented 2 years ago

Hi, awni! thanks for your gred repo, I have a problem in How to use the decode interface : I have tried to use code like following: ` B, T, *_ = scores.size()

   logit_lengths = torch.full((B, ), T, dtype=torch.int, device=scores.device)

   y = torch.full([B, 1], 0, dtype=torch.int32, device=scores.device)

    cur_len = 0

    for i in range(T):
        old_y = y
        preds, _ = self.pred_net(old_y)
        label_lengths = torch.full((B, ), cur_len, dtype=torch.int, device=scores.device)
        y = self.criterion.viterbi(scores, preds,logit_lengths, label_lengths)
        b, new_len = y.shape
        if new_len < 1:
            break
         print("shape of y is: ", y.shape)
        cur_len = new_len

`

but I always got break at the first step

awni commented 2 years ago

Hmm yes I think you misunderstood the viterbi function. It's really meant to simulate teacher forcing so you would call it with the full predictions and not in a greedy fashion. Also the viterbi will remove blanks so you won't be guaranteed an output for every input frame.

xiongjun19 commented 2 years ago

Hmm yes I think you misunderstood the viterbi function. It's really meant to simulate teacher forcing so you would call it with the full predictions and not in a greedy fashion. Also the viterbi will remove blanks so you won't be guaranteed an output for every input frame.

thank you very much, awni! I really like this project so much, as it's really fast and memory efficient. it would be greater if there's a high performance end to end decoding interface. because, I have implement a python one , which is too slow to use

csukuangfj commented 2 years ago

By the way, there is an alternative implementation in k2, which is called rnnt_loss_simple

There are training code and decoding code for it.

Although a joiner network with only a simple adder can save memory, it leads to degradation in WER when used for ASR training according to our previous experience.

awni commented 2 years ago

@csukuangfj the degradation in WER from using a simple joiner is minor in my experience. The benefit of the low memory implementation are

  1. You can use word pieces (> 1000 tokens) easily
  2. Train with longer utterance (> 10 seconds)
  3. Train with larger batch sizes to train efficiently

Overall I think the cost of the joiner is not worth the benefit. Though it would be nice to see a careful study there. However if the above situation does not apply (small token sets, short utterances, small batches) then you won't get much gain from the memory lite and faster version.