Does the CTC decoder in Flashlight support word piece decoding without a lexicon?

flashlight / text

Text utilities, including beam search decoding, tokenizing, and more, built for use in Flashlight.

MIT License

64 stars 14 forks source link

@Squire-tomsk — the answer is: yes! The lexicon-free CTC decoder absolutely supports decoding with subword tokens. It can support decoding with any set of tokens.

Your intuition as to the requirements for the tokens list is correct — the acoustic model should be trained on the same subword units that will be used for decoding. It's possible to create arbitrary mappings between special symbols and the subword units you're using to decode.

In the case of using an external language model, it should be able to give scores over sequences of those subword units or some symbols which are mapped therein. It is also possible to use a word-level language model in this setting which can score sequences of subword units that complete words — this is tricky to do with a lexicon-free language model since you may get token sequences which don't map to words.

Regardless, the token set for the acoustic model and the Dictionary used at decoding time should be identical, and the language model can be trained on any [super]-set of those tokens.

flashlight / text

Does the CTC decoder in Flashlight support word piece decoding without a lexicon? #66