flashlight / text

Text utilities, including beam search decoding, tokenizing, and more, built for use in Flashlight.
MIT License
64 stars 15 forks source link

Is the sil token necessary? #33

Closed maxwellzh closed 1 year ago

maxwellzh commented 1 year ago

Question

There're cases that there's no sil token like sentencepiece tokenizer and Asian languages where sequences are just consist of characters (Chinese, Japanese, etc.). And for cases with sil token, I think the sil has nothing difference to a meaningful token like a-z.

AFAIK, other implementations like pyctcdecode and nvidia'nemo do not consider the sil in beam search. Can you explain why it is introduced in flashlight?

kamirdin commented 1 year ago

I have the same question about sil ( "|" in example ) ,in nvidia'nemo ,they pass "unk_id" to flahslight.lib.text.decoder.Trie and LexiconDecoder instead of word boundary "|" .is this right ? when I need decode language like Chinese which has not modeled "|" in AM ,what should I define sil in LexiconDecoder ?

jacobkahn commented 1 year ago

@kamirdin @maxwellzh — many acoustic models are frequently trained with silence tokens in their vocabulary sets. The blank token exists to enable proper encoding of repetitive characters, which is indeed distinct from word separators, which are used at training time to facilitate proper tokenization.

The sil token in the case of the CTC decoder is a special token which has a dedicated score (e.g. silScore). To use the decoder without the sil token, it's easy enough to set the silScore to zero or set the index of the sil token equal to an index that should never be emitted by an acoustic model, i.e. -1. In the case of NeMo, the desired behavior is for the model to not decode unknown tokens, so they're considered silence/boundary tokens.

In the case of Chinese/Japanese/character-based languages, the sil token can be used as an unreachable token ID and the score set to zero, if detailed above, if the model doesn't need to emit silence. For some applications, silence can be helpful (such as for voice activity detection, or VAD), but its use isn't required.