`.wav` files with multiple sentences

isaacleeai commented 5 years ago

I am planning on training / validating / testing with audio files with multiple sentences.

I have a few concerns:

when preparing .tkn and .wrd files, should we put each sentence on separate lines? example of .wrd
```
hello
how are you doing
not too well
```
will there be a problem with GPU RAM since the size of each audio file is multiple times bigger? ( none of them exceed tens of megabytes, though )

jacobkahn commented 5 years ago

If I understand you properly, what you're proposing isn't a good idea — "token" files are used as the emission set over which the acoustic model outputs a multinomial distribution. In your case, AM would thus emit probabilities over the space of "hello," "how are you doing," and "not to well." Typically, these token files are letters, subword units (word pieces, etc), or in some cases, individual words. Final transcriptions themselves are constructed with the decoder, which uses a lexicon and a language model to score possible sequences given AM emissions.

Input audio files are split up automatically into batches of frames, so assuming your file size isn't unreasonable large, everything should work. A few megabytes should be alright.

isaacleeai commented 5 years ago

The example I have above is a .wrd file, not a token file. a token file with multiple sentences would look like this:

h e l l o
h o w | a r e | y o u | d o i n g
n o t | t o o | w e l l

To clarify my question 1, would there be an inherent problem with having multiple sentences per .tkn, .wav. and .wrdfile?

jacobkahn commented 5 years ago

What is your goal here? The .wrd file is intended to function as a lexicon for the decoder. It defines the constrained vocabulary (set of valid words) over which the decoder will output words as part of a transcription.

There are a few potential (but likely surmountable) issues with the approach you're suggesting:

Leveraging the language model in decoding will be more difficult when you aren't decoding at the word level — you'll need need hack it to compute n-gram probabilities for various sequences of words, which will require some manual adjustments, and may not work in general depending on how the language model was trained and the size of your search space.
The current decoder implementation will insert each entry into the trie in its entirety. Since it isn't designed to discriminate between multiple words in the same lexicon entry, the trie will be quite large, memory intensive, and slow to search since the average size of each element in the lexicon is large (each entry consists of multiple words).

cc @xuqiantong who can provide more insight.

flashlight / wav2letter

`.wav` files with multiple sentences #197