freewym / espresso

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
Other
941 stars 116 forks source link

Problem with Long Utterances for MALACH Corpus #49

Closed picheny-nyu closed 3 years ago

picheny-nyu commented 3 years ago

I am trying to use espresso to decode the MALACH Corpus. One of the characteristics of MALACH is that the training utterances are all short ( < 8 secs on the whole) but the test data contains a significant number of long utterances ( . 20 seconds). I am observing that on these long utterances it produces decent output for the first 5-6 seconds, deteriorates rapidly thereafter, puts out some repeated words, and then stops decoding resulting in many deletions. This is for a transformer model based on the wsj recipe. MALACH has about 160 hours of training data. I would welcome some suggestions/help here - it almost looks like some parameter setting would fix things.

Thanks Michael

freewym commented 3 years ago

Are you using external LM shallow fusion for decoding? Shallow fusion tends to have such problem. See if the deletion error is reduced without shallow fusion.

Anyways I think the length mismatch between training / decoding is the cause. There are several work in literature trying to mitigate this, e.g.:

https://arxiv.org/pdf/1911.02242.pdf https://arxiv.org/pdf/1910.11455.pdf

picheny-nyu commented 3 years ago

Ok will turn it off. Thanks for the pointers. Do you have any plans to implement? Or if you point me to the appropriate modules and give me some high level instructions, maybe I will try myself. I assume the second paper (on streaming RNN-Ts) is less relevant?

freewym commented 3 years ago

In order to implement the first paper, I think you might need to modify espresso/data/asr_dataset.py or add a new dataset class to chop utterances into overlapping segments, and then modify espresso/speech_recognize.py to merge hyps from all the segments within a long utterance.

picheny-nyu commented 3 years ago

Thanks. How about the attention aspects? (forcing monotonic attention).

freewym commented 3 years ago

Maybe you can get some reference from https://github.com/freewym/espresso/tree/master/examples/simultaneous_translation/modules

picheny-nyu commented 3 years ago

OK, I turned off shallow fusion but it still stops decoding after about 8-10 seconds for the longer utterances. WER is about 60%. Note the Kaldi decoding with the TDNN Hybrid for this corpus is about 24%. Any other parameters to work with before I have to resort to more extreme measures?

picheny-nyu commented 3 years ago

Typical long utterance attention plot, if this suggests something.

00103202-0137567.pdf

freewym commented 3 years ago

OK, I think so there is no obvious way to avoid such issue without specially designed algorithms