Open ghost opened 1 year ago
I realized that the vocabulary size of 500 and max sequence length of 680 (after subsampling of 4) may not capture some very realistic scenarios
Actually, we have been using vocab size 500 for LibriSpeech in icefall for quite a long time.
max sequence length of 680 (after subsampling of 4)
The value is selected so that it won't cause OOM for all of the implementations used in the benchmark.
Since torchaudio supports fp16 logits, I would suggest adding a row for that.
Are you interested in contributing that?
Thanks for your response.
Actually, we have been using vocab size 500 for LibriSpeech in icefall for quite a long time.
I see. Would be good to have a table for larger vocabularies as well :) Does optimized_transducer being 2x slower than torchaudio at 4096 vocab size make sense to you?
The value is selected so that it won't cause OOM for all of the implementations used in the benchmark.
Actually what I meant was that you could go smaller given systems with 6x and 8x subsampling are pretty common these days.
Are you interested in contributing that?
Sure. Did you run the benchmark on a 32G V100?
Does optimized_transducer being 2x slower than torchaudio at 4096 vocab size make sense to you?
I don't expect optimized_transducer to be that slower than torchaudio.
We were using torchaudio for a reference implementation, https://github.com/csukuangfj/optimized_transducer/blob/master/optimized_transducer/csrc/kernels.cu#L95 but later we switched to use fast_rnnt
Sure. Did you run the benchmark on a 32G V100?
Thanks!
Yes, you are right.
but later we switched to use fast_rnnt
Is that the k2 row?
Yes.
Hi, thanks for working on this and sharing on github. The setup is very easy to follow and helps draw useful conclusions. I would like to suggest some improvement for the future: