flashlight / text

Text utilities, including beam search decoding, tokenizing, and more, built for use in Flashlight.
MIT License
64 stars 14 forks source link

Does the CTC decoder in Flashlight support word piece decoding without a lexicon? #66

Closed Squire-tomsk closed 1 year ago

Squire-tomsk commented 1 year ago

I noticed that in Issue 1031 in flashlight repo, @jacobkahn mentioned that Flashlight supports word piece/sentencepiece/BPE decoding, but requires a lexicon with word spellings in terms of word pieces. However, I am curious about whether the CTC decoder in Flashlight can support lexicon-free decoding with subword tokens.

If it does support subword tokens, what are the requirements for the tokens list and language model? Should the language model be trained on words or subword units like in NeMo, where subword units map to special symbols and then the language model is trained on these symbols? Thank you for your help.

jacobkahn commented 1 year ago

@Squire-tomsk — the answer is: yes! The lexicon-free CTC decoder absolutely supports decoding with subword tokens. It can support decoding with any set of tokens.

Your intuition as to the requirements for the tokens list is correct — the acoustic model should be trained on the same subword units that will be used for decoding. It's possible to create arbitrary mappings between special symbols and the subword units you're using to decode.

In the case of using an external language model, it should be able to give scores over sequences of those subword units or some symbols which are mapped therein. It is also possible to use a word-level language model in this setting which can score sequences of subword units that complete words — this is tricky to do with a lexicon-free language model since you may get token sequences which don't map to words.

Regardless, the token set for the acoustic model and the Dictionary used at decoding time should be identical, and the language model can be trained on any [super]-set of those tokens.