flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.37k stars 1.01k forks source link

How to do inference on a new audio file using the trained acoustic model? #564

Closed ethanabrooks closed 3 years ago

ethanabrooks commented 4 years ago

Hi, I am trying to use these instructions (https://github.com/facebookresearch/wav2letter/wiki/Inference-Run-Examples) to perform ASR on a .wav file of my own.

I can perform successful transcription on one of the LibriSpeech files with the following command:

sudo docker run --rm -v $HOME:/root/host/ -it --ipc=host --name w2l -a stdin -a stdout -a stderr \
wav2letter/wav2letter:inference-latest sh -c \
"/root/wav2letter/build/inference/inference/examples/multithreaded_streaming_asr_example \
--input_files_base_path /root/host/model \
--input_audio_files /root/host/audio/LibriSpeech/dev-clean/777/126732/777-126732-0070.flac.wav \
--output_files_base_path /root/host/audio/LibriSpeech-dev-clean-transcribed"

Then

❯ cat LibriSpeech-dev-clean-transcribed/777-126732-0070.flac.wav.txt
#start (msec), end(msec), transcription
0,1000,
1000,2000,he was out of
2000,3000,his mind with something
3000,4000,he overheard about eating
4000,5000,people's flesh
5000,6000,and drinking blood
6000,7000,what's the good of
7000,7315,of talking like that

However, when I try to run the same command on my audio file (ThomasReedBrooks/segments/out0000000000.wav):

sudo docker run --rm -v $HOME:/root/host/ -it --ipc=host --name w2l -a stdin -a stdout -a stderr \
wav2letter/wav2letter:inference-latest sh -c \
"/root/wav2letter/build/inference/inference/examples/multithreaded_streaming_asr_example \
--input_files_base_path /root/host/model \
--input_audio_files /root/host/audio/ThomasReedBrooks/segments/out0000000000.wav \
--output_files_base_path /root/host/audio/ThomasReedBrooks-dev-clean-transcribed"

I get

❯ cat ThomasReedBrooks-dev-clean-transcribed/out0000000000.wav.txt
#start (msec), end(msec), transcription
0,1000,
1000,2000,h h h h h
2000,3000,h h h h h
3000,4000,
4000,5000,
5000,6000,
6000,7000,
7000,8000,
8000,9000,
9000,10000,
10000,11000,
11000,12000,
12000,13000,
13000,14000,
14000,15000,
15000,16000,
16000,17000,
17000,18000,
18000,19000,
19000,20000,
20000,21000,
21000,22000,
22000,23000,
23000,24000,
24000,25000,
25000,26000,
26000,27000,
27000,28000,
28000,29000,
29000,30000,
30000,31000,
31000,32000,
32000,33000,
33000,34000,
34000,35000,
35000,36000,
36000,37000,
37000,38000,
38000,39000,
39000,40000,
40000,41000,
41000,42000,
42000,43000,
43000,44000,
44000,45000,
45000,46000,
46000,47000,
47000,48000,
48000,49000,
49000,50000,
50000,51000,
51000,52000,
52000,53000,
53000,54000,
54000,55000,
55000,56000,
56000,57000,
57000,58000,
58000,59000,
59000,60000,
60000,61000,
61000,62000,
62000,63000,
63000,64000,
64000,65000,
65000,66000,
66000,67000,
67000,68000,
68000,69000,
69000,70000,
70000,71000,
71000,72000,
72000,73000,
73000,74000,
74000,75000,
75000,76000,
76000,77000,
77000,78000,
78000,79000,
79000,80000,
80000,81000,
81000,82000,
82000,83000,
83000,84000,
84000,85000,
85000,86000,
86000,87000,
87000,88000,
88000,89000,
89000,90000,
90000,91000,
91000,92000,
92000,93000,
93000,94000,
94000,95000,
95000,96000,
96000,97000,
97000,98000,
98000,99000,
99000,100000,
100000,101000,
101000,102000,
102000,103000,
103000,104000,
104000,105000,
105000,106000,
106000,107000,
107000,108000,
108000,109000,
109000,110000,
110000,111000,
111000,112000,
112000,113000,
113000,114000,
114000,115000,
115000,116000,
116000,117000,
117000,118000,
118000,119000,
119000,120000,
120000,120066,

I am wondering if the audio quality is not good enough on my file, or if the fact that there are multiple voice/accents is confusing the model. You can hear the audio with this link: https://drive.google.com/file/d/1zQ1lo6vrgkT-M3lLHQkj4NaGZAaunoJi/view?usp=sharing

lunixbochs commented 4 years ago

Try this model: https://talonvoice.com/research/talon-streaming-convnets-1.tar.gz

Also make sure your audio is in the right format. 16-bit 16khz mono

ethanabrooks commented 4 years ago

Hmm... even on the original dataset I'm getting a segfault:

❯ sudo docker run --rm -v $HOME:/root/host/ -it --ipc=host --name w2l -a stdin -a stdout -a stderr \
wav2letter/wav2letter:inference-latest sh -c \
"/root/wav2letter/build/inference/inference/examples/multithreaded_streaming_asr_example \
--input_files_base_path /root/host/talon-streaming-convnets-1 \
--input_audio_files /root/host/audio/LibriSpeech/dev-clean/777/126732/777-126732-0070.flac.wav \
--output_files_base_path /root/host/audio/LibriSpeech-dev-clean-transcribed"
Will process 1 files.
Started features model file loading ...
Completed features model file loading elapsed time=3072 microseconds

Started acoustic model file loading ...
Completed acoustic model file loading elapsed time=1372 milliseconds

Started tokens file loading ...
Completed tokens file loading elapsed time=1142 microseconds

Tokens loaded - 9998 tokens
Started decoder options file loading ...
terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to open decoder options file=/root/host/talon-streaming-convnets-1/decoder_options.json for reading
Aborted (core dumped)
ethanabrooks commented 4 years ago

I notice that the decoder_options.json file and several others are missing from the talon-streaming-convnets-1/ directory.

❯ tar -xvf talon-streaming-convnets-1.tar.gz
talon-streaming-convnets-1/
talon-streaming-convnets-1/tokens.txt
talon-streaming-convnets-1/feature_extractor.bin
talon-streaming-convnets-1/acoustic_model.bin

Should I just copy them over from the one I was using previously?

lunixbochs commented 4 years ago

Actually mine probably won’t work with that decoder as the tokens won’t match. I dunno then. Try the Test binary.

lunixbochs commented 4 years ago

Are you able to share your test audio file? I can run it locally against my models and let you know if it's the audio or some other problem.

ethanabrooks commented 4 years ago

You can hear the audio with this link: https://drive.google.com/file/d/1zQ1lo6vrgkT-M3lLHQkj4NaGZAaunoJi/view?usp=sharing

Thanks!

lunixbochs commented 4 years ago

That link doesn’t appear to be public

ethanabrooks commented 4 years ago

Apologies. Let's try this link: https://drive.google.com/file/d/14sbwdVObkSMFSmF9N9FypyG9eVkHPYTl/view?usp=sharing

lunixbochs commented 4 years ago

That file is a bit noisy / quiet and dual speaker, I'm not entirely surprised. But some model should recognize the last half. I'll take a look shortly.

ethanabrooks commented 4 years ago

Excellent. Thank you!

On Thu, Mar 5, 2020 at 11:08 AM Ryan Hileman notifications@github.com wrote:

That file is a bit noisy / quiet and dual speaker, I'm not entirely surprised. But some model should recognize the last half. I'll take a look shortly.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/wav2letter/issues/564?email_source=notifications&email_token=ACO5SJW6FJUF7LUPCYBBKBTRF7FATA5CNFSM4K7GEZC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN53L7I#issuecomment-595310077, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACO5SJWIURPVGDSH5R7VUZ3RF7FATANCNFSM4K7GEZCQ .

lunixbochs commented 4 years ago

This is your problem, as I guessed:

  Channels: 2 @ 25-bit   
Samplerate: 48000Hz      
facebook convnet:
|P|: h h

my convnet:
|P|: 

# converted the file as follows:
$ sox out0000000000.wav -r 16000 -c 1 -b 16 out0000000000.flac

facebook convnet:
|P|: thought talking not just a normal voice as when good
but this has an interview with thomas reed brooks conducted on monday august

my convnet:
|P|: talking about just normal of voice imagine you ago
but this is an interview with thomas reed brooks andd conducted on monday august

(This is just Test output, no language model or Decoder was used)

ethanabrooks commented 4 years ago

Wow thank you so much. So should I just convert the files and use the instructions I referenced? Or would you recommend a different model? It seems like a language model would be beneficial, no?

ethanabrooks commented 4 years ago

Hmmm @lunixbochs, am I doing this correctly? It looks like I'm getting the same blank output as before, even after converting the file.

❯ sudo docker run --rm -v $HOME:/root/host/ -it --ipc=host --name w2l -a stdin -a stdout -a stderr \
wav2letter/wav2letter:inference-latest sh -c \
"/root/wav2letter/build/inference/inference/examples/multithreaded_streaming_asr_example \
--input_files_base_path /root/host/model \
--input_audio_files /root/host/audio/ThomasReedBrooks/segments/out0000000000.flac \
--output_files_base_path /root/host/audio/ThomasReedBrooks-dev-clean-transcribed"

❯ cat ThomasReedBrooks-dev-clean-transcribed/out0000000000.flac.txt
#start (msec), end(msec), transcription
0,1000,
1000,2000,h h h h h h h
2000,3000,h h h h
3000,4000,
4000,5000,
5000,5687,
5687,5687,

❯ sox --i ThomasReedBrooks/segments/out0000000000.flac

Input File     : 'ThomasReedBrooks/segments/out0000000000.flac'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:10.01 = 160085 samples ~ 750.398 CDDA sectors
File Size      : 182k
Bit Rate       : 146k
Sample Encoding: 16-bit FLAC
Comment        : 'Comment=Processed by SoX'
lunixbochs commented 4 years ago

Facebook's released models are more research-y, mine are slightly more real world. You can't use the streaming frontend without a language model right now, and I don't have a language model trained on my streaming convnet model.

However if you look at their results, a language model didn't have a very big impact on the WER for streaming convnets: https://github.com/facebookresearch/wav2letter/tree/master/recipes/models/streaming_convnets/librispeech#results

lunixbochs commented 4 years ago

I don't know anything about the streaming frontend. It should work if you use the Test frontend on the whole file at once (which requires the model from the recipe and not the model from the inference wiki)

ethanabrooks commented 4 years ago

Hmmm I see. That's interesting. So would I need to use your model in order to avoid the problem I'm having with no output? Was it any of the ones linked here? https://github.com/lunixbochs/wav2letter/tree/master/models

lunixbochs commented 4 years ago

I ran Test with this model (the results I posted labeled facebook convnet): https://github.com/facebookresearch/wav2letter/tree/master/recipes/models/streaming_convnets/librispeech#pre-trained-acoustic-models

ethanabrooks commented 4 years ago

I think I might be getting tripped up on the file structure, but whenever I try to run the docker command I get a segfault. I've scanned through the docs and I didn't see instructions about what file structure the program expects for the model. Right now I am doing this:

curl https://dl.fbaipublicfiles.com/wav2letter/streaming_convnets/librispeech/models/am/am_500ms_future_context_dev_other.bin > ~/model/facebook-convnet/acoustic_model.bin

curl https://dl.fbaipublicfiles.com/wav2letter/streaming_convnets/librispeech/models/lm/3-gram.pruned.3e-7.bin.qt > ~/model/facebook-convnet/language_model.bin

curl https://dl.fbaipublicfiles.com/wav2letter/streaming_convnets/librispeech/librispeech-train-all-unigram-10000.tokens > ~/model/facebook-convnet/ltokens.txt

curl https://dl.fbaipublicfiles.com/wav2letter/tds/librispeech/librispeech-train%2Bdev-unigram-10000-nbest10.lexicon > ~/model/facebook-convnet/lexicon.txt

curl https://dl.fbaipublicfiles.com/wav2letter/streaming_convnets/librispeech/am_500ms_future_context.arch > ~/model/facebook-convnet/tds_streaming.arch

then

❯ sudo docker run --rm -v $HOME:/root/host/ -it --ipc=host --name w2l -a stdin -a stdout -a stderr wav2letter/wav2letter:inference-latest sh -c " /root/wav2letter/build/inference/inference/examples/simple_streaming_asr_example --input_files_base_path /root/host/model/facebook-convnet /root/host/audio/ThomasReedBrooks/segments/out0000000000.flac"
Started features model file loading ...
terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to open feature file=/root/host/model/facebook-convnet/feature_extractor.bin for reading
Aborted (core dumped)

So I ran

wget http://dl.fbaipublicfiles.com/wav2letter/inference/examples/model/feature_extractor.bin

In the hopes that it would also work for this model. Next I ran

❯ sudo docker run --rm -v $HOME:/root/host/ -it --ipc=host --name w2l -a stdin -a stdout -a stderr wav2letter/wav2letter:inference-latest sh -c " /root/wav2letter/build/inference/inference/examples/simple_streaming_asr_example --input_files_base_path /root/host/model/facebook-convnet /root/host/audio/ThomasReedBrooks/segments/out0000000000.flac"
Started features model file loading ...
Completed features model file loading elapsed time=3135 microseconds

Started acoustic model file loading ...
terminate called after throwing an instance of 'cereal::Exception'
  what():  Failed to read 4 bytes from input stream! Read 0
Aborted (core dumped)
lunixbochs commented 4 years ago

Again, am_500ms_future_context_dev_other.bin does not work with the streaming frontend, it is for the Test or Decode binary.

ethanabrooks commented 4 years ago

I'm sorry maybe I am getting confused by terminology. I'm not really clear what the "streaming frontend" is. Is there documentation for running the model that you are describing @lunixbochs?

abhinavkulkarni commented 4 years ago

@ethanabrooks: Looks like your input audio file is not in a correct format.

wav2letter expects the input audio file to be mono-channel, sampled at 16 kHz.

You can convert an audio.mp3 to audio.wav as follows:

ffmpeg -i audio.mp3 -ar 16000 -ac 1 -ab 256k -f wav audio.wav, where 256k is the bitrate.

Here's how you would do streaming inference on microphone audio:

The pre-trained models should work on the audio.wav file.

ffmpeg -hide_banner -loglevel error -f alsa -i default -ar 16000 -ac 1 -ab 256k -f wav - | /root/wav2letter/build/inference/inference/examples/simple_streaming_asr_example --input_files_base_path=/root/host/model/