alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.7k stars 1.08k forks source link

Different results between WAVE and FFMPEG reading audio file in Python #1286

Open tienanh28122000 opened 1 year ago

tienanh28122000 commented 1 year ago

Hi everyone, I've found that if we change the method to read the audio file (from WAVE to FFMPEG), the WER increase dramatically. When I use WAVE to read audio files (4000 utts), the WER is 9.64%. But when I use FFMPEG instead, the WER is decreased to 4.77%. Can you explain why the difference exist? Btw, I've printed the result of each method when reading audio (process.stdout.read(4000) and data = wf.readframes(4000)) and the result was different.

nshmyrev commented 1 year ago

ffmpeg modifies audio - applies dither. It might cause difference in results but not that big, only fraction of percent. If your WER changes that much probably your model is not good enough.

nshmyrev commented 1 year ago

also readframes 4000 is equivalent to read(8000) since frame is 2 bytes

tienanh28122000 commented 1 year ago

I've found the params that make the result different. The reason is: _subprocess.Popen(["ffmpeg", "-loglevel", "quiet", "-i", path, "-ar", str(SAMPLERATE) , "-ac", "1", "-f", "s16le", "-"], stdout=subprocess.PIPE) If I delete the params str(SAMPLE_RATE), the result from 2 methods is equal. But if I add the str(SAMPLE_RATE) into that func, WER reduce dramatically. Can you explain why the SAMPLE_RATE effect the most in this situation? Thank you very much!