Closed isaacleeai closed 5 years ago
If I understand you properly, what you're proposing isn't a good idea — "token" files are used as the emission set over which the acoustic model outputs a multinomial distribution. In your case, AM would thus emit probabilities over the space of "hello," "how are you doing," and "not to well." Typically, these token files are letters, subword units (word pieces, etc), or in some cases, individual words. Final transcriptions themselves are constructed with the decoder, which uses a lexicon and a language model to score possible sequences given AM emissions.
Input audio files are split up automatically into batches of frames, so assuming your file size isn't unreasonable large, everything should work. A few megabytes should be alright.
The example I have above is a .wrd
file, not a token file.
a token file with multiple sentences would look like this:
h e l l o
h o w | a r e | y o u | d o i n g
n o t | t o o | w e l l
To clarify my question 1, would there be an inherent problem with having multiple sentences per .tkn
, .wav
. and .wrd
file?
What is your goal here? The .wrd
file is intended to function as a lexicon for the decoder. It defines the constrained vocabulary (set of valid words) over which the decoder will output words as part of a transcription.
There are a few potential (but likely surmountable) issues with the approach you're suggesting:
cc @xuqiantong who can provide more insight.
I am planning on training / validating / testing with audio files with multiple sentences.
I have a few concerns:
when preparing
.tkn
and.wrd
files, should we put each sentence on separate lines? example of.wrd
will there be a problem with GPU RAM since the size of each audio file is multiple times bigger? ( none of them exceed tens of megabytes, though )