csukuangfj / optimized_transducer

Memory efficient transducer loss computation
Other
68 stars 12 forks source link

loss value and decode library? #30

Open xiongjun19 opened 2 years ago

xiongjun19 commented 2 years ago

thanks very much for your great project! I have two questions to ask:

  1. how big is the the transducer loss for a well performed model? or the model is converged?
  2. is there any fast decode solution? I found the decode module in many project implementing the beam search decode algorithm is extremely slow
csukuangfj commented 2 years ago

Please have a look at https://github.com/k2-fsa/icefall

You can find tensorboard training logs in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md

  1. how big is the the transducer loss for a well performed model? or the model is converged?

The average loss per frame is about 0.02 or below.

is there any fast decode solution?

Yes, please see modified beam search in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/transducer_stateless/beam_search.py#L363

There is only one loop in the time axis.

We have documentation for how to use it with a pre-trained model. Please see https://icefall.readthedocs.io/en/latest/recipes/aishell/stateless_transducer.html

There is also a Colab notebook for it https://colab.research.google.com/drive/12jpTxJB44vzwtcmJl2DTdznW0OawPb9H?usp=sharing

csukuangfj commented 2 years ago

Note: The above beam search is implemented in Python and it decodes only one utterance at one time.

We are implementing it in C++ with CUDA, which can decode multiple utterances in parallel. Please see https://github.com/k2-fsa/k2/pull/926

It will be wrapped to Python soon.

xiongjun19 commented 2 years ago

Please have a look at https://github.com/k2-fsa/icefall

You can find tensorboard training logs in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md

  1. how big is the the transducer loss for a well performed model? or the model is converged?

The average loss per frame is about 0.02 or below.

is there any fast decode solution?

Yes, please see modified beam search in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/transducer_stateless/beam_search.py#L363

There is only one loop in the time axis.

We have documentation for how to use it with a pre-trained model. Please see https://icefall.readthedocs.io/en/latest/recipes/aishell/stateless_transducer.html

There is also a Colab notebook for it https://colab.research.google.com/drive/12jpTxJB44vzwtcmJl2DTdznW0OawPb9H?usp=sharing

wow, you answer is really helpful, thank you very much

xiongjun19 commented 2 years ago

Dear csukuangfj! I have study the code carefully in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/transducer_stateless/beam_search.py#L363 and modified according to my trained model and code structure, and compare it with decode mothed according to speechbrain, I got the following result, I'm using it to some basecalling task: batch_size: 8, time_steps : 720;

  1. speech_brain_dec: acc: 94.00%; speed: 11.70s/it;
  2. icefall_dec: acc: 93.7%; speed: 6.10/it;

the speed is much better, and thanks for your work. I'm here to ask is there any documentation about the c++ decode interface (https://github.com/k2-fsa/k2/pull/926) you mentioned before ?

csukuangfj commented 2 years ago

If you try the k2 pruned rnnt loss, https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/model.py#L160 , it is even faster, you may get 4.0/it. [EDIT]: I thought it was training time.

There is a Python interface for it. See https://github.com/k2-fsa/icefall/pull/250

We will add C++ interface for it later, i.e., provide only a header file and some pre-compiled libraries.

https://github.com/k2-fsa/icefall/pull/250 is even faster if you use it for decoding.

xiongjun19 commented 2 years ago

If you try the k2 pruned rnnt loss, https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/model.py#L160 , it is even faster, you may get 4.0/it. [EDIT]: I thought it was training time.

There is a Python interface for it. See k2-fsa/icefall#250

We will add C++ interface for it later, i.e., provide only a header file and some pre-compiled libraries.

k2-fsa/icefall#250 is even faster if you use it for decoding. Dear csukuangfj! I have tried rnnt loss from https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/model.py#L160, I have two things to update with you : first one , to my surpurise, the loss is quite large, I'm not quite sure is there any problem ? : the loss and metrics in my first epoch is as following:

 loss:  9192.901783988205
 metric: accuracy: 93.09%

second: I have modified the modified decode method to support batchly decoding , so the performance in decode speed are as following: batch_size: 8 , time steps 720

         speech_brain_dec: acc: 94.00%;  speed: 11.70s/it; 
         icefall_dec: acc: 93.7%; speed: 6.10s /it;
        icefall_dec_batch: acc: 93.7%; speed: 1.73s/it;

very thanks for you information, and I will try to use the interface https://github.com/k2-fsa/icefall/pull/250 you mentioned some time later.

csukuangfj commented 2 years ago

to my surprise, the loss is quite large

Please clarify whether the loss is


By the way, how do you measure the decoding time? Do you have any RTF available?

xiongjun19 commented 2 years ago

to my surprise, the loss is quite large

Please clarify whether the loss is

  • the sum of the loss over all frames in the batch
  • or the average loss over utterances in the batch
  • or the average loss over all frames in the batch ?

By the way, how do you measure the decoding time? Do you have any RTF available?

The loss code as following :

image

so I guess, the loss is the sum of the loss over all frames in batch.

Decoding time : I'm trying to use in a batch way, so RTF is not available in this condition, my measure is very simple: how much time does it take to complete a inference of a batch data. and I found that the decoding is the bottleneck, as it takes about 99% time.

csukuangfj commented 2 years ago

so I guess, the loss is the sum of the loss over all frames in batch.

Yes, you can divide it by the number of acoustic frames after subsampling in the model. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/train.py#L495

    info["frames"] = (feature_lens // params.subsampling_factor).sum().item()
xiongjun19 commented 2 years ago

so I guess, the loss is the sum of the loss over all frames in batch.

Yes, you can divide it by the number of acoustic frames after subsampling in the model. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/train.py#L495

    info["frames"] = (feature_lens // params.subsampling_factor).sum().item()

ok