Open xiongjun19 opened 2 years ago
Please have a look at https://github.com/k2-fsa/icefall
You can find tensorboard training logs in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md
- how big is the the transducer loss for a well performed model? or the model is converged?
The average loss per frame is about 0.02 or below.
is there any fast decode solution?
Yes, please see modified beam search in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/transducer_stateless/beam_search.py#L363
There is only one loop in the time axis.
We have documentation for how to use it with a pre-trained model. Please see https://icefall.readthedocs.io/en/latest/recipes/aishell/stateless_transducer.html
There is also a Colab notebook for it https://colab.research.google.com/drive/12jpTxJB44vzwtcmJl2DTdznW0OawPb9H?usp=sharing
Note: The above beam search is implemented in Python and it decodes only one utterance at one time.
We are implementing it in C++ with CUDA, which can decode multiple utterances in parallel. Please see https://github.com/k2-fsa/k2/pull/926
It will be wrapped to Python soon.
Please have a look at https://github.com/k2-fsa/icefall
You can find tensorboard training logs in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md
- how big is the the transducer loss for a well performed model? or the model is converged?
The average loss per frame is about 0.02 or below.
is there any fast decode solution?
Yes, please see modified beam search in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/transducer_stateless/beam_search.py#L363
There is only one loop in the time axis.
We have documentation for how to use it with a pre-trained model. Please see https://icefall.readthedocs.io/en/latest/recipes/aishell/stateless_transducer.html
There is also a Colab notebook for it https://colab.research.google.com/drive/12jpTxJB44vzwtcmJl2DTdznW0OawPb9H?usp=sharing
wow, you answer is really helpful, thank you very much
Dear csukuangfj! I have study the code carefully in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/transducer_stateless/beam_search.py#L363 and modified according to my trained model and code structure, and compare it with decode mothed according to speechbrain, I got the following result, I'm using it to some basecalling task: batch_size: 8, time_steps : 720;
the speed is much better, and thanks for your work. I'm here to ask is there any documentation about the c++ decode interface (https://github.com/k2-fsa/k2/pull/926) you mentioned before ?
If you try the k2 pruned rnnt loss, https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/model.py#L160
, it is even faster, you may get 4.0/it
. [EDIT]: I thought it was training time.
There is a Python interface for it. See https://github.com/k2-fsa/icefall/pull/250
We will add C++ interface for it later, i.e., provide only a header file and some pre-compiled libraries.
https://github.com/k2-fsa/icefall/pull/250 is even faster if you use it for decoding.
If you try the k2 pruned rnnt loss, https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/model.py#L160 , it is even faster, you may get
4.0/it
. [EDIT]: I thought it was training time.There is a Python interface for it. See k2-fsa/icefall#250
We will add C++ interface for it later, i.e., provide only a header file and some pre-compiled libraries.
k2-fsa/icefall#250 is even faster if you use it for decoding. Dear csukuangfj! I have tried rnnt loss from https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/model.py#L160, I have two things to update with you : first one , to my surpurise, the loss is quite large, I'm not quite sure is there any problem ? : the loss and metrics in my first epoch is as following:
loss: 9192.901783988205
metric: accuracy: 93.09%
second: I have modified the modified decode method to support batchly decoding , so the performance in decode speed are as following: batch_size: 8 , time steps 720
speech_brain_dec: acc: 94.00%; speed: 11.70s/it;
icefall_dec: acc: 93.7%; speed: 6.10s /it;
icefall_dec_batch: acc: 93.7%; speed: 1.73s/it;
very thanks for you information, and I will try to use the interface https://github.com/k2-fsa/icefall/pull/250 you mentioned some time later.
to my surprise, the loss is quite large
Please clarify whether the loss is
By the way, how do you measure the decoding time? Do you have any RTF available?
to my surprise, the loss is quite large
Please clarify whether the loss is
- the sum of the loss over all frames in the batch
- or the average loss over utterances in the batch
- or the average loss over all frames in the batch ?
By the way, how do you measure the decoding time? Do you have any RTF available?
The loss code as following :
so I guess, the loss is the sum of the loss over all frames in batch.
Decoding time : I'm trying to use in a batch way, so RTF is not available in this condition, my measure is very simple: how much time does it take to complete a inference of a batch data. and I found that the decoding is the bottleneck, as it takes about 99% time.
so I guess, the loss is the sum of the loss over all frames in batch.
Yes, you can divide it by the number of acoustic frames after subsampling in the model. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/train.py#L495
info["frames"] = (feature_lens // params.subsampling_factor).sum().item()
so I guess, the loss is the sum of the loss over all frames in batch.
Yes, you can divide it by the number of acoustic frames after subsampling in the model. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/train.py#L495
info["frames"] = (feature_lens // params.subsampling_factor).sum().item()
ok
thanks very much for your great project! I have two questions to ask: