Decoding now yielding duplicate words

LukeNotable commented 3 years ago

Bug Description

Starting with the recent commit b1d1f89, I'm seeing duplicate words from inference.

Using the simple_streaming_asr_example with some of our data, for example: Before:

#start (msec), end(msec), transcription
0,1000,
1000,2000,history of
2000,2832,of present illness

After:

#start (msec), end(msec), transcription
0,1000,
1000,2000,history of
2000,2832,of present present illness illness

Digging a little deeper, I can see that the two instances of present and illness each refer to different start/end frames.

Reproduction Steps

I noticed this when comparing the current docker image of inference-latest (built four months ago) with a local build of the same image from v0.2. Reverting that commit fixes the problem.

I'm seeing the difference running simple_streaming_asr_example (along with our own fuller-featured variation of it, with which I can see the frames for each word, for example), using our models and audio files, and the same problem recurs across lots of different audio files. If it helps, I may be able to share the model and audio separately.

Platform and Hardware

Using the docker container, on MacOS Big Sur, no GPU.

tlikhomanenko commented 3 years ago

Thanks for reporting this issue. The commit you pointed fixed the problem for standard ctc model decoding when we train it with surround. We will investigate how the inference should be fixed with respect to this fix. Will update as we solve the problem.

vineelpratap commented 3 years ago

As a temporary fix, you can revert these changes manually https://github.com/facebookresearch/flashlight/commit/ce02babd2f413643bb4ba7064827f4404ed2758e and build.

It should solve the issue.

flashlight / wav2letter