csukuangfj / transducer-loss-benchmarking

Other
64 stars 10 forks source link

Benchmark on more realistic settings #17

Open ghost opened 10 months ago

ghost commented 10 months ago

Hi, thanks for working on this and sharing on github. The setup is very easy to follow and helps draw useful conclusions. I would like to suggest some improvement for the future:

  1. As I was playing with your code, I realized that the vocabulary size of 500 and max sequence length of 680 (after subsampling of 4) may not capture some very realistic scenarios and your tables may lead people to draw wrong conclusions. Specifically, it is common these days to use a vocabulary size of 4096 combined with 6x-8x subsampling. When I run your benchmarks on a T4 machine with a lower max-tokens and smaller sequence length, the picture changes quite a bit. torchaudio gives similar speed as warprnnt_numba and speechbrain, it comes out to be 2x faster than optimized_transducer.
  2. Since torchaudio supports fp16 logits, I would suggest adding a row for that.
csukuangfj commented 10 months ago

I realized that the vocabulary size of 500 and max sequence length of 680 (after subsampling of 4) may not capture some very realistic scenarios

Actually, we have been using vocab size 500 for LibriSpeech in icefall for quite a long time.

max sequence length of 680 (after subsampling of 4)

The value is selected so that it won't cause OOM for all of the implementations used in the benchmark.


Since torchaudio supports fp16 logits, I would suggest adding a row for that.

Are you interested in contributing that?

ghost commented 10 months ago

Thanks for your response.

Actually, we have been using vocab size 500 for LibriSpeech in icefall for quite a long time.

I see. Would be good to have a table for larger vocabularies as well :) Does optimized_transducer being 2x slower than torchaudio at 4096 vocab size make sense to you?

The value is selected so that it won't cause OOM for all of the implementations used in the benchmark.

Actually what I meant was that you could go smaller given systems with 6x and 8x subsampling are pretty common these days.

Are you interested in contributing that?

Sure. Did you run the benchmark on a 32G V100?

csukuangfj commented 10 months ago

Does optimized_transducer being 2x slower than torchaudio at 4096 vocab size make sense to you?

I don't expect optimized_transducer to be that slower than torchaudio.

We were using torchaudio for a reference implementation, https://github.com/csukuangfj/optimized_transducer/blob/master/optimized_transducer/csrc/kernels.cu#L95 but later we switched to use fast_rnnt


Sure. Did you run the benchmark on a 32G V100?

Thanks!

Yes, you are right.

ghost commented 10 months ago

but later we switched to use fast_rnnt

Is that the k2 row?

csukuangfj commented 10 months ago

but later we switched to use fast_rnnt

Is that the k2 row?

Yes.