Open LukeNotable opened 3 years ago
Thanks for reporting this issue. The commit you pointed fixed the problem for standard ctc model decoding when we train it with surround. We will investigate how the inference should be fixed with respect to this fix. Will update as we solve the problem.
As a temporary fix, you can revert these changes manually https://github.com/facebookresearch/flashlight/commit/ce02babd2f413643bb4ba7064827f4404ed2758e and build.
It should solve the issue.
Bug Description
Starting with the recent commit b1d1f89, I'm seeing duplicate words from inference.
Using the simple_streaming_asr_example with some of our data, for example: Before:
After:
Digging a little deeper, I can see that the two instances of
present
andillness
each refer to different start/end frames.Reproduction Steps
I noticed this when comparing the current docker image of
inference-latest
(built four months ago) with a local build of the same image from v0.2. Reverting that commit fixes the problem.I'm seeing the difference running simple_streaming_asr_example (along with our own fuller-featured variation of it, with which I can see the frames for each word, for example), using our models and audio files, and the same problem recurs across lots of different audio files. If it helps, I may be able to share the model and audio separately.
Platform and Hardware
Using the docker container, on MacOS Big Sur, no GPU.