flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

online decoder results depends on stream data length #344

Closed goodmajia closed 5 years ago

goodmajia commented 5 years ago

I use a wav file to simulated on-line decoding, as following:

decoder.decodeBegin() 
    while (stream)
      some_emission = net->foward(someData)
      decoder.decodeStep(some_emission) 
      decoder.getBestHypothesis() 
      decoder.prune() 
   decoder.decodeEnd() 

but different length of "someData" has different decoding results, especially when the length less than 1s(1600 pints in 16k wav), the result is very bad. 

As I understand it, the result should be independent of data length. Am I wrong? or what's should I do if I want to get the same fixed good result in online streaming decoding?

lunixbochs commented 5 years ago

You're missing a step in your pseudocode - where do you run the network forward pass?

The network has a ~1700ms receptive field, so using audio input much shorter than that will have worse results. You should use a voice activity detector and feed the audio in at all once as I demonstrated here: https://github.com/facebookresearch/wav2letter/issues/327

vineelpratap commented 5 years ago

FWIW, we generally use 500ms chunk size which has been working well for us.

lunixbochs commented 5 years ago

In the case of my small model, I can run the forward pass in ~100ms on CPU, so my latency will be lower than using 500ms chunks (where the expected minimum latency would be 250ms, then the network cost) unless your utterance is extremely long (as the network latency goes up with input length, though I don’t think entirely linearly).

Don’t raw 500ms chunks also discard information at the borders of the frame the network would otherwise use to make predictions?

I think due to the parallel nature of convolution and the fact the input is padded with 170ms on either side, it’s also moderately faster invoke the network once against a larger chunk of audio than to run it repeatedly on small chunks.

I’ve had much better performance and more consistent accuracy with a VAD than any streaming chunk methods I tried, even with windowing. Even if you’re using a stream to provide a prediction as they talk I’d still consider running a final acoustic model pass against the whole utterance audio for better accuracy.

goodmajia commented 5 years ago

@lunixbochs I have revised the pseudocode, the network forward pass is in while loop. Thank you for your good advice!
BTW, If I don't have a vad, do you have any suggestions to get good result in streaming decoding where every time the arriving stream chunk less than 500ms?

lunixbochs commented 5 years ago

Buffer it until you have 500ms.

goodmajia commented 5 years ago

@lunixbochs thanks. Is it possible every time I concatenate current stream chunk with previous chunk to get chunk large than 500ms and feed it to network, then I only need to get emission related with current chunk?
If this can work, it will have less latency for users if I understand right