csukuangfj / transducer-loss-benchmarking

Other
64 stars 10 forks source link

Alignment-restricted RNNT #14

Open desh2608 opened 1 year ago

desh2608 commented 1 year ago

First of all, thanks for this amazing work in benchmarking the several available RNNT implementations. This is more of a "discussion" rather than an issue.

I am sure you are aware about this, but the FB speech group uses a kind of "pruned" RNNT where the pruning is done using external alignments (paper link). The idea is that for each token u in U, you restrict the time-steps (< T) that it can occur in, by using alignments obtained from, say, a hybrid ASR. This effectively "prunes" the lattice in a way similar to the k2 pruned RNNT. I imagine that the first pass "trivial" joiner is approximating a similar alignment between T and U, to then prune the lattice.

I was wondering how hard it would be to implement something like alignment-restricted RNNT in k2, given the pruned framework. From a high-level view, it would basically require using external alignments to prune the lattice, and the 2nd pass of the loss computation can proceed as before. I am interested because I think if we have access to external alignments, training a model on conditions involving noise and babble in the background might be easier, since the trivial joiner may have a hard time especially at the beginning of training.

I would be happy to hear your thoughts on the matter.

csukuangfj commented 1 year ago

http://arxiv.org/abs/2011.03072 As you commented:

The idea is that for each token u in U, you restrict the time-steps (< T) that it can occur in

In pruned RNN-T, we use:

For each time step t, we restrict the number of symbols it can emit to S, where S is a fixed parameter, e.g., 3

In Ar-RNN-T, the number of symbols can be emitted is different at each time step t. I am afraid we cannot simply replace the trivial joiner using external alignments.


To implement Ar-RNN-T, I suggest that you can use https://github.com/csukuangfj/optimized_transducer as a starting point.

I would like to help with it.

desh2608 commented 1 year ago

Thanks for your comment. Yeah, that was my main concern --- I was not sure how easy it would be to set a variable S per time step in the pruned RNNT framework.

It seems that AR-RNNT also uses the sequence concatenation and function merging from Microsoft's paper, so you are right that optimized_transducer would be a good starting point. I will take a look later this month and try to implement it. I don't have a lot of experience with CUDA so your help would be much appreciated.