flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.37k stars 1.01k forks source link

Getting better accuracy with restricted/limited vocabulary #933

Open tumusudheer opened 3 years ago

tumusudheer commented 3 years ago

I'm working on command detection. My vocabulary is bunch of names (~ 10000) and bunch of commands (~20), Sample command is CALL . Before training, I'll have the knowledge of both the names and the commands, so I can include them in lexicon and language model.

I tried a couple of following approaches but seems the results are not good: I'm using Streaming Convnets

Approach 1: I wanted to finetune from a base pre-trained model provided here , so I added all my names and commands to an existing lexicon which I called as extended lexicon. And trained by forking from base model .

While Testing, I've prepared another lexicon with only my names and commands (restricted lexicon) to restrict the outputs be from my vocabulary (commands + names) , And the WER is ~ 54%. For Decoding, I've prepared LM with my training data, then WER became ~ 33%.

Approach 2: I've used the lexicon prepared using only my names and commands (restricted lexicon) while training using fork (fine tuning) and it gave very bad results. Seems this is not a good approach ?

Approach 3: Which I've not yet done is train the model from scratch instead of finetuning. I wanted to avoid this because my dataset is very small in the order 100 hours of audio data. But if this is the best approach to get better accuracy, I'll try this.

Questions:

  1. Is it possible to improve the accuracy using approach 1.? Am I doing something wrong ?
  2. Should I choose different recipie model other than streaming convnets for my use case ?
  3. Is there a different approach I can try to improve accuracy ?

Thank you,

tlikhomanenko commented 3 years ago

The model you use is a word-piece model. Could you check if all your commands and names in the lexicon are represented as letters tokens? There could be mismatch between lexicon we used to build the word pieces with your data. One thing to try is to use recent robust ASR model (you need latest flashlight to work with it) which is letter based, tutorial on fine tuning on your data here https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr/tutorial (check colab on fine tuning).

In approach 2 do you have all your data words in the lexicon or only names and commands (or names and commands covers all words in your transcriptions)?

tumusudheer commented 3 years ago

Hi @tlikhomanenko ,

Thank you very much. Commands + Names covers all the words in my transcription(train + dev + Test set), [at least for now]. Since I want to fork from exiting pre-trained base model, I usedlibrispeech-train-all-unigram-10000.model generated from here to prepare my restricted lexicon (letter tokens and word tokens for my names and commands)

A Sample my lexicon is as follows:

susana  _susan a
susana  _su s ana
susana  _s us ana
susana  _su sa na
susana  _su s an a
susana  _su s a na
susana  _s u s ana
susana  _ s us ana
susana  _s us an a
susana  _s us a na
susannah    _susan n ah
susannah    _susan na h
susannah    _susan n a h
susannah    _su s an n ah
susannah    _su s an na h
susannah    _s us an n ah
susannah    _s us an na h
susannah    _su sa n n ah
susannah    _su sa n na h
susannah    _s u s an n ah
susanne _susan ne
susanne _susan n e
susanne _su s an ne
susanne _s us an ne
susanne _su sa n ne
susanne _su s an n e
susanne _s u s an ne
susanne _s us an n e
susanne _ s us an ne
susanne _su s a n ne
susie   _su s ie
susie   _s us ie
susie   _su si e
susie   _s u s ie
susie   _su s i e
susie   _ s us ie
susie   _s us i e
susie   _s u si e
susie   _ s u s ie
susie   _s u s i e
suter   _su ter
suter   _s ut er
suter   _su t er
suter   _s u ter
suter   _su te r
suter   _ s ut er
suter   _ s u ter
suter   _s u t er
suter   _s ut e r
suter   _su t e r
sutter  _s ut ter
sutter  _su t ter
sutter  _su tte r
sutter  _ s ut ter
sutter  _s ut t er
sutter  _su t t er
sutter  _s u t ter
sutter  _s ut te r
sutter  _su t te r
sutter  _s u tte r
sutton  _s ut ton
sutton  _su t ton
sutton  _ s ut ton
sutton  _s u t ton
sutton  _s ut t on
sutton  _su t t on
sutton  _s ut to n
sutton  _su t to n
sutton  _ s u t ton
sutton  _ s ut t on
suzanne _su z an ne
suzanne _su za n ne
suzanne _su z an n e
suzanne _su za n n e
suzanne _s u z an ne
suzanne _s u za n ne
suzanne _su z a n ne
suzanne _s u z an n e
suzanne _ s u z an ne
suzanne _s u za n n e
suze    _su ze
suze    _su z e
suze    _s u ze
suze    _s u z e
suze    _ s u ze
suze    _ s u z e
suzette _su z ette
suzette _su ze tte
suzette _su z ett e
suzette _s u z ette
suzette _su z e tte
suzette _su ze t te
suzette _su z et te
suzette _s u ze tte
suzette _ s u z ette
suzette _su ze t t e

The above lexicon is prepared using librispeech-train-all-unigram-10000.model. and function:

sp = spm.SentencePieceProcessor()
sp.Load('/facebook/work/streaming_convnets/librispeech/model_dst/am/librispeech-train-all-unigram-10000.model')
def prepare_additional_lexicon_words(lexicon_words):
    to_file = []
    for nbest in [10]:
        for word in lexicon_words:
            wps = sp.NBestEncodeAsPieces(word, nbest)
            for wp in wps: # the order matters for our training
                to_file.append(" ".join([w.replace("\u2581", "_") for w in wp]))
    return to_file

In approach 2 do you have all your data words in the lexicon or only names and commands (or names and commands covers all words in your transcriptions)?

I made a mistake while working on approach 2 earlier,: (Using Restricted lexicon that contains only names + commands) for both AM training (forking from base model) and decoding. I'm getting better results with this approach (current WER ~ 35%) compared to other approaches. I need to figure out how to improve the results in this approach, I may try data augmentation with noise mixed, etc.. Questions regarding the approach 2 (Streaming convnets + restricted lexicon with only names + commands) Question 1:

This is how I prepare LM for decoding: Took all transcriptions from train + dev set and used the following command: /facebook/kenlm/bin/lmplz --text combined_for_lm_train_dev.lst.pruned --arpa self_3-gram.arpa.lower -o 3 --prune 0 0 3 --discount_fallback And then using self_3-gram.arpa for decoding.

./Decoder  --flagsfile /facebook/work/streaming_convnets/decoder/decode_500ms_right_future_ngram_other.cfg --lm /facebook/work/streaming_convnets//for_lm/self_3-gram.arpa.lower --lmweight=0.55  --wordscore=0 --uselexicon=true --decodertype=wrd --lmtype=kenlm--silscore=0 --beamsize=500 --beamsizetoken=100 --beamthreshold=100 --nthread_decoder=8
--smearing=max --show --showletters

Please let me know if there are any steps I'm missing for preparing LM or improve the LM preparation to prepare better LM.

Question 2: Since I'm restricting the lexicon to only commands + names, will this model (streaming_convnets in approach 2) predict out of lexicon words ? Say some test audio contains 'hey how are you doing', but these words are not present in my lexicon, will the model be able to decode correct, even if I use --uselexicon=false while decoding ?

Question 3: Whenever new names or new commands are added, do I need to retrain the AM also ? Or just adding them to lexicon and training the LM would be sufficient ?

One thing to try is to use recent robust ASR model (you need latest flashlight to work with it) which is letter based, tutorial on fine tuning on your data here https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr/tutorial (check colab on fine tuning).

I've installed new flashlight couple of days before, Sure, I'll try this one as well this week.

tumusudheer commented 3 years ago

Hi @tlikhomanenko ,

Thank you very much. I tried the first rasr model in the tutorial (Transformer with 70M parameters) and it worked really well. My Accuracy is in mid 80s but that is because the testing set contains lot of noise which training set doesn't have much representation, I'm working on this. I'll also try other rasr models.

For LM preparation, currently I'm only creating arpa file from train+dev transcriptions. /facebook/kenlm/bin/lmplz --text combined_for_lm_train_dev.lst.pruned --arpa self_3-gram.arpa.lower -o 3 --prune 0 0 3 --discount_fallback Using the above command, I'm preparing 3-gram Arpa. After this I can use build_binary to make the apra to binary. How do I optimize LM and make use of test set to improve the LM or change some hyper parameters ? Any other commands I need to use after lmplz and build_binary ?

Q2: I would like to deploy ASR as a service or standalone code. For this, I need to change Decoder.cpp so that it can take take the audio data from memory rather than from .wav as input. I guess I need to change this part so that the ds param can be prepared from audio_data that is in memory ?

The flashlight documentation says, the decoder supports online decoding here Decoders, except Seq2Seq decoder, are now supporting online decoding. It consumes small chunks of emissions of audio as input. At the time we want to have a look at the transcript so far, we may get the best transcript and prune the hypothesis space and keep decoding further.

Does it mean that that Decoder.cpp in flashlight supports online decoding or do we have a separate online decoder file ? Is there any example code available to do inference using online decoding or what needs to change the decoder code to support online decoding ?

Thank you

zxpan commented 3 years ago

Hi @tlikhomanenko, W.r.t fl_asr_tutorial_finetune_ctc, are there any concerns that i continue/resume the finetuning from last run? the reason I am asking is that my one run fine-tuning takes days (due to resource limitation), would like not re-start the fine-tuning from the scratch with different learning rate.. I did not try or configure lr-decay parameter yet.

Thanks in advance.

....One thing to try is to use recent robust ASR model (you need latest flashlight to work with it) which is letter based, tutorial on fine tuning on your data here https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr/tutorial (check colab on fine tuning).

tumusudheer commented 3 years ago

Hi @tlikhomanenko ,

Even for me the fl_asr_tutorial_finetune_ctc is taking a long time to run. My total list contains ~ 102 K records (each 3 seconds length) and with a batch size of 8, each epoch is taking about 1 hr to run which is very descent. But when I ran another training with training list ~ 180 k records, (same batch size and same machine), then each epoch is taking about 8 hrs ? I'm using am_transformer_ctc_stride3_letters_70Mparams ? Let me know if there is something we need to change ?

Thank you

tlikhomanenko commented 3 years ago

Hey, sorry for delay! Here are my comments:

Please let me know if there are any steps I'm missing for preparing LM or improve the LM preparation to prepare better LM.

It is fine, you can improve if you train 4,5-gram or without pruning (but model size will be larger, so depends on your memory limits). You also need to tweak decoder parameters, like beam size and word score, because you are using another lm.

Question 2: Since I'm restricting the lexicon to only commands + names, will this model (streaming_convnets in approach 2) predict out of lexicon words ? Say some test audio contains 'hey how are you doing', but these words are not present in my lexicon, will the model be able to decode correct, even if I use --uselexicon=false while decoding ?

It won't as soon as you are using word based decoding or word-based lm. You need to train then wp LM and use it setting --decoder_type=tkn and --uselexicon=false (in case lm is word base it is using always lexicon base decoding).

Question 3: Whenever new names or new commands are added, do I need to retrain the AM also ? Or just adding them to lexicon and training the LM would be sufficient ?

Hard to say, depending how well your model generalizes and how strong lm it learnt inside acoustic model. But for CTC model should work that you extend only the lexicon and change the LM.

Using the above command, I'm preparing 3-gram Arpa. After this I can use build_binary to make the apra to binary.

You can still use arpa, arpa and binary are equivalent, it is more about loading the model to speed up it. You can play with -o - number of ngrams (4, 5) and remove pruning. Only these two params I often optimize.

Q2: I would like to deploy ASR as a service or standalone code. For this, I need to change Decoder.cpp so that it can take take the audio data from memory rather than from .wav as input. I guess I need to change this part so that

Yep, you need to change the data loading, so if you have them in memory or load somehow in different way - change this logic on loading.

Does it mean that that Decoder.cpp in flashlight supports online decoding or do we have a separate online decoder file ? Is there any example code available to do inference using online decoding or what needs to change the decoder code to support online decoding ?

It is not calling online decoding but you can change it to do so, you need to call additional method of decoder during decoding, prune https://github.com/facebookresearch/flashlight/blob/master/flashlight/lib/text/decoder/LexiconDecoder.h#L142, so you need to change the logic on execution. Example of online decoding is in our inference code here https://github.com/facebookresearch/wav2letter/blob/a1c74f4d8c9c2ce6a127889f37f779dbfe36b937/recipes/streaming_convnets/inference/inference/examples/AudioToWords.cpp#L48 and here https://github.com/facebookresearch/wav2letter/blob/master/recipes/streaming_convnets/inference/inference/decoder/Decoder.cpp.

W.r.t fl_asr_tutorial_finetune_ctc, are there any concerns that i continue/resume the finetuning from last run? the reason I am asking is that my one run fine-tuning takes days (due to resource limitation), would like not re-start the fine-tuning from the scratch with different learning rate.. I did not try or configure lr-decay parameter yet.

Should not be, the only issue is optimizer - we used adagrad + decaying lr, so if you continue it probably will not learn at all, as lr small and accumulated momentum is huge. You can simply do fork from the model and use any optimizer you want, should work too, with small lr.

Even for me the fl_asr_tutorial_finetune_ctc is taking a long time to run. My total list contains ~ 102 K records (each 3 seconds length) and with a batch size of 8, each epoch is taking about 1 hr to run which is very descent. But when I ran another training with training list ~ 180 k records, (same batch size and same machine), then each epoch is taking about 8 hrs ? I'm using am_transformer_ctc_stride3_letters_70Mparams ? Let me know if there is something we need to change ?

Do you have larger average duration/target size for the second list? if yes, then this causing the slowdown. If you trying on colab - try to use the main train.cpp binary with --bathing_strategy=dynamic --bathing_max_duration=??? which will use dynamic batching, so probably you can pack more. Or even check if with 3s you can increase the batchsize from 8 to 10-16. Also with main Train.cpp you can try AMP training, which can give you also speedup.

Also can you post here your log, want to have a look at timing you have.

Best.

tumusudheer commented 3 years ago

Hi @tlikhomanenko ,

Please find the logs attached here to debug timing issue. My set1 has max length of audio file ~ 6s and avg . length is ~ 5.2 seconds. But the second sets max audio files 7.5 seconds and average is about 7 seconds. 001_log.txt 001_config.txt

Thank you very much. I'm currently using rasr receipies (transformer with 70 mil parameter model) as it is giving better accuracy (~ 81 % from AM and ~ 88 with AM+Langugage model) so I'll put aside streaming convnets for this usecase. Also Thank you for suggestions on converting Decoder code to support online streaming. I'll start working on it and will post here if I've any questions.

As I was analyzing my results where the model is getting bad results, I found a couple of interesting patterns. These are after Decoding (AM+Decoding results)

Instance 1:

Ground truth: "call jessica borows"
AM Prediction: "call jessica bars"
Final AM+Decoding Prediction: "call jessica"

The data used for LM has "call jessica borows" three times. But the last name got missed here. Similarly,

Instance 2:

Ground truth: "call antoinette chappel"
AM Prediction: "call antonet chappel"
Final AM+Decoding Prediction: "call chappel"

The data used for LM has "call antoinette chappel" four times. But the first name got missed here.

The command I used for LM prep: ./kenlm/bin/lmplz --text combined_for_lm_train_dev.lst.pruned --arpa self_3-gram.arpa.lower -o 3 --discount_fallback And the decoding flags file here

--am=/data/Self/research/facebook/work/streaming_convnets/research_data/trail_training/phase_2/new_flashlight/run_0127_1/001_model_iter_003.bin
--datadir=/data/Self/research/facebook/work/streaming_convnets/research_data/trail_training/
--test=test_order.lst
--maxload=-1
--nthread_decoder=8
--tokens=/data/Self/research/facebook/work/flashlight_0118/pretrained-models/tutorial/tokens.txt
--lexicon=/data/Self/research/facebook/work/streaming_convnets/research_data/trail_training/phase_2/new_flashlight/letter_tokens_lexicon/lexicon.txt
--uselexicon=true
--lm=/data/Self/research/facebook/work/streaming_convnets/research_data/trail_training/for_lm/self_3-gram.arpa.lower
--lmtype=kenlm
--beamsize=500
--beamsizetoken=30
--beamthreshold=100
--smearing=max
--lmweight=2
--wordscore=0
--eosscore=0
--silscore=0
--unkscore=0
--show
--sholetters

I got the same issues (either first name or last name missing) with beamsize = 1000 + beamsizetoken=60 as well. Is there any way to fix this in LM or Decoder ?

If you trying on colab - try to use the main train.cpp binary with --bathing_strategy=dynamic --bathing_max_duration=??? which will use dynamic batching, so probably you can pack more

Currently I'm usingfl_asr_tutorial_finetune_ctc. I guess I can do the same with Train.cpp with FORK option ? I'll try these options as well as AMP training to get better speeds.

Thank you,

tlikhomanenko commented 3 years ago

Great to hear your progress!

I got the same issues (either first name or last name missing) with beamsize = 1000 + beamsizetoken=60 as well. Is there any way to fix this in LM or Decoder ?

It looks like these words are missed from the lexicon. Could you recheck that they are present there? Also you can check --showletters to check what was the letter base prediction (before combining into words).

About logs: could you attach both logs, which is fast (set1) and which is slow (set2)? From the log you sent - looks good, one thing I noticed that 1000 updates took 3:40 min for forward/backward (which is fine), however you run evaluation which took you ~40min (see timestamps, while runtime is you data load+fwd+bwd time for 1000 updates). So you can use --reportiters=10000 for example not to wait time on validation.

Currently I'm usingfl_asr_tutorial_finetune_ctc. I guess I can do the same with Train.cpp with FORK option ? I'll try these options as well as AMP training to get better speeds.

Yep. fl_asr_tutorial_finetune_ctc - is more simple version of Train.cpp without advanced features so that people simpler can start their training in tutorial.

tumusudheer commented 3 years ago

Hi @tlikhomanenko ,

Thank you very much.

It looks like these words are missed from the lexicon. Could you recheck that they are present there? Also you can check --showletters to check what was the letter base prediction (before combining into words).

Here is the letter based prediction:
|T|: call jessica borows
|P|: call jessica
|t|: c a l l | j e s s i c a | b o r o w s
|p|: c a l l | j e s s i c a | b a s

And the lexicon contains (borows word which is part of the ground truth ,Ground truth: "call jessica borows")

borows  b o r o w s |
borowski        b o r o w s k i |
basile  b a s i l e |
bass    b a s s |
bascoe  b a s c o e |
bassett b a s s e t t |

Similarly, for instance 2:

|t|: c a l l | a n t o i n e t t e | c h a p p e l
|p|: c a l l | a n t o n e t | c h a p p e l

And the lexicon contains

antonia a n t o n i a |
antoinette      a n t o i n e t t e |
antonio a n t o n i o |

About logs: could you attach both logs, which is fast (set1) and which is slow (set2)? Please find the logs here set1_001_log.txt set1_001_conf.txt

I noticed for each 1000 iters: set 1 is taking ~ 4 to 5 mins where as set 2 was taking ~ 45 mins ?

From the log you sent - looks good, one thing I noticed that 1000 updates took 3:40 min for forward/backward (which is fine), however you run evaluation which took you ~40min (see timestamps, while runtime is you data load+fwd+bwd time for 1000 updates).

I've noticed my dev set has lot more records than train set which is due to a mistake. usually it should be ~ 5% total train set and I'll correct this. Thank you very much.

Thank you,

tlikhomanenko commented 3 years ago

About decoding - you can see that it produce some similar by pronunciation word in letter prediction and because you are using lexicon-based and you don't have "bas" and "antonet" in the lexicon we convert these unk words to the empty output as it will not affect WER computation. You can tweak the output by converting token-based output into words or tweaking Decode.cpp to use https://github.com/facebookresearch/flashlight/blob/master/flashlight/app/asr/Decode.cpp#L666 in case lexicon is given.

tumusudheer commented 3 years ago

Hi @tlikhomanenko ,

Thank you very much. So if I use --uselexicon=false in the decoding step, then I'll get these kind of close words to Ground truth words that are not present in lexicon ?

Or should I use --uselexicon=true and in decoding step but https://github.com/facebookresearch/flashlight/blob/master/flashlight/app/asr/Decode.cpp#L661, make the if condition always false to I get wordPrediction = tkn2Wrd(letterPrediction, FLAGS_wordseparator); as final output ?

tlikhomanenko commented 3 years ago

--uselexicon=false will use lexicon free decoding, so you need to have token level LM. For your current use case please first try make the if condition always false to I get wordPrediction = tkn2Wrd(letterPrediction, FLAGS_wordseparator); as final output

tumusudheer commented 3 years ago

Hi @tlikhomanenko ,

Thank you very much. Just a QQ: How is silscore used? It is something related to silence (silence score) at the beginning and end of the Audio file ? We are giving 0 at the time of decoding , just wanted to check what is the appropriate value.

Also For streaming example, I'm using the following code used from the code given InferenceCTC.cpp

For now, I'm reading the data into chunks (each 2 seconds chunk) and running the forward pass:

std::vector<float> audio = fl::app::asr::loadSound<float>(input_audio_file_path.c_str());
decoder->decodeBegin();
int buff_size = 1600*20; // 1600 - 100 milli seconds, 1600*20 is a 2 seconds chunk
int i = 0;
while(i < audio.size())
{
    volatile int limit = ((audio.size()-i) > buff_size) ? buff_size : (audio.size()-i);
    std::vector<float> audio_chunk_data; // using this as the buffer to hold the chunk data
   for(int j = 0; j < limit; j++)
   {
       audio_chunk_data.push_back(audio[i+j]);
   }
   i += limit;

   { // Block to overcome minimum limits I guess  ?
        if (limit <= 1600)
    {
           std::cout << "Overriding chunk size " << limit << "\n";
            int min_limit = 1600;
            for (int k = limit; k < min_limit; k++)
            {
                audio_chunk_data.push_back(0.0f);
            }
            limit = min_limit;
            std::cout << "New limit  " << limit << "\n";
        }
   }
   af::array input = inputTransform( static_cast<void*>(audio_chunk_data.data()), af::dim4(1, limit), af::dtype::f32);
   auto inputLen = af::constant(input.dims(0), af::dim4(1));
   auto rawEmission = fl::ext::forwardSequentialModuleWithPadMask(fl::input(input), network, inputLen);
   auto emission = fl::ext::afToVector<float>(rawEmission);

   decoder->decodeStep(emission.data(), rawEmission.dims(1), rawEmission.dims(0));

   fl::lib::text::DecodeResult rawResult = decoder->getBestHypothesis(true);

   // Take top hypothesis and cleanup predictions
   auto rawWordPrediction = rawResult.words;
   auto rawTokenPrediction = rawResult.tokens;

}

decoder->decodeEnd();
const auto& result = decoder->getAllFinalHypothesis();

// Take top hypothesis and cleanup predictions
auto rawWordPrediction = result[0].words;
auto rawTokenPrediction = result[0].tokens;

#if 1

    auto letterPrediction = fl::app::asr::tknPrediction2Ltr(
            rawTokenPrediction,
            tokenDict,
            fl::app::asr::kCtcCriterion,
            networkFlags["surround"],
            false /* eostoken */,
            0 /* replabel */,
            false /* usewordpiece */,
            networkFlags["wordseparator"]);

    std::vector<std::string> wordPrediction;
    wordPrediction = fl::app::asr::tkn2Wrd(letterPrediction, networkFlags["wordseparator"]);
    auto wordPredictionStr = fl::lib::join(" ", wordPrediction);
    printf("%s --> FINAL OUTPUT {%s}\n",input_audio_file_path.c_str(), wordPredictionStr.c_str());
#endif

Please let me know I'm doing anything wrong ? If I keep chunk_size (buff_size) small (say 1600 which is 100 milli seconds or 160010 which is 1 second chunk), I'm getting wrong results. so I'm using 2 seconds chunk which is 1600 20. Is there a way I can keep lower chunk size and get good results.

Also In the last iterations, if the left over (last) chunk is about 320 (may be 20 milli seconds), I'm getting dim4 exception in this line af::array input = inputTransform( static_cast<void*>(audio_chunk_data.data()), af::dim4(1, limit), af::dtype::f32);,

terminate called after throwing an instance of 'af::exception'
  what():  ArrayFire Exception (Invalid input size:203):
In function af::dim4 verifyDims(unsigned int, const dim_t*)
In file src/api/c/data.cpp:39
Invalid dimension for argument 2
Expected: dims[i] >= 1

so I thought I need min chunk size of 1600 (100 milli seconds) and wrapping that chunk with all 0s at the end:

if (limit <= 1600)
{
     std::cout << "Overriding chunk size " << limit << "\n";
     int min_limit = 1600;
     for (int k = limit; k < min_limit; k++)
    {
      audio_chunk_data.push_back(0.0f);
    }
   limit = min_limit;
   std::cout << "New limit  " << limit << "\n";
}

Thank you

tlikhomanenko commented 3 years ago

Had a quick look at your code. Looks fine for me. @xuqiantong any comment on decoder?

What the model are you using? If transformer, then I don't expect really that small chunks will work, as you have very small context and model is not learnt in this streaming way. Why do you want less than 2 sec, is 2 sec too large?

For your error it is really not clear why the error occurs. I would debug what exactly arrayfire array creation is crashing, and crash because one of the array dim is 0, so you know what to check and what to search.