flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

`.wav` files with multiple sentences #197

Closed isaacleeai closed 5 years ago

isaacleeai commented 5 years ago

I am planning on training / validating / testing with audio files with multiple sentences.

I have a few concerns:

  1. when preparing .tkn and .wrd files, should we put each sentence on separate lines? example of .wrd

    hello
    how are you doing
    not too well
  2. will there be a problem with GPU RAM since the size of each audio file is multiple times bigger? ( none of them exceed tens of megabytes, though )

jacobkahn commented 5 years ago

If I understand you properly, what you're proposing isn't a good idea — "token" files are used as the emission set over which the acoustic model outputs a multinomial distribution. In your case, AM would thus emit probabilities over the space of "hello," "how are you doing," and "not to well." Typically, these token files are letters, subword units (word pieces, etc), or in some cases, individual words. Final transcriptions themselves are constructed with the decoder, which uses a lexicon and a language model to score possible sequences given AM emissions.

Input audio files are split up automatically into batches of frames, so assuming your file size isn't unreasonable large, everything should work. A few megabytes should be alright.

isaacleeai commented 5 years ago

The example I have above is a .wrd file, not a token file. a token file with multiple sentences would look like this:

h e l l o
h o w | a r e | y o u | d o i n g
n o t | t o o | w e l l

To clarify my question 1, would there be an inherent problem with having multiple sentences per .tkn, .wav. and .wrdfile?

jacobkahn commented 5 years ago

What is your goal here? The .wrd file is intended to function as a lexicon for the decoder. It defines the constrained vocabulary (set of valid words) over which the decoder will output words as part of a transcription.

There are a few potential (but likely surmountable) issues with the approach you're suggesting:

cc @xuqiantong who can provide more insight.