Conformer CTC occasionally returns transcripts with multiple consecutive spaces

itzsimpl commented 2 years ago

While analysing datasets via SpeechDataExplorer I discovered that transcripts (in our case generated with Conformer CTC BPE) occasionally contain multiple consecutive spaces. This influences the computation of CER, but is also very confusing because the text diff section in SDE does not show any visual differences. The text diff also fails to show differences for UTF-8 symbols that appear in different versions in text and pred_text (composed and decomposed form), but the difference is correctly taken into account when computing WER and CER, hence this becomes confusing because there is no visual difference between the two transcripts.

Tracking down the source for multiple consecutive spaces I see that this can occur if a blank_id token separates consecutive space characters: https://github.com/NVIDIA/NeMo/blob/91e6c2552a9ca8d00bf8ce758adea68d9c49f803/nemo/collections/asr/metrics/wer.py#L162-L165

Perhaps following Huggingface tokenizer.decode a flag clean_up_tokenization_spaces with the default value of True could be added to decode_tokens_to_str and decode_ids_to_tokens. The sole purpose of this flag would be to clean up the resulting data of multiple consecutive space symbols. If you agree I can make a PR.

titu1994 commented 2 years ago

Wait, Huggingface tokenizer has such a flag !? We've been having this issue with WPE models for so long.. FYI @VahidooX

We need to be a bit careful, since that tokenizer is shared with NLP and NMT so we don't want to ruin their models, perhaps there can be some clean way of handling this for ASR alone (cause by default, we never want spaces to be injected by a tokenizer)

itzsimpl commented 2 years ago

@titu1994 What I was suggesting was to simply add a cleanup step before returning the result to https://github.com/NVIDIA/NeMo/blob/91e6c2552a9ca8d00bf8ce758adea68d9c49f803/nemo/collections/asr/metrics/wer.py#L188-L213 and the corresponding BPE versions https://github.com/NVIDIA/NeMo/blob/91e6c2552a9ca8d00bf8ce758adea68d9c49f803/nemo/collections/asr/metrics/wer_bpe.py#L147-L172 so if my understanding of the codebase is correct, this would affect only ASR. Am I mistaken?

titu1994 commented 2 years ago

No we won't be performing pre and postprocessing steps in ASR Tokenizers. Our recommendation is to not use hugging face wpe tokenizer for anything other than Librispeech.

VahidooX commented 2 years ago

Note that this issue is just with the WPE tokenizer and sentencepiece tokenizers don't have it.

itzsimpl commented 2 years ago

Strange you say this is just with the WPE tokenizer, as I am using a SentencePiece unigram tokenizer.

itzsimpl commented 2 years ago

More so, as IMHO the issue is not caused by the tokenizer itself, but by the approach with which ctc_decoder_predictions_tensor drops blank_id tokens, i.e. it does not treat for the special case where a blank_id separates two space tokens. This means that as long as the tokenizer can output a special space token, the issue can happen.

titu1994 commented 2 years ago

If there's a blank between two spaces, we consider it as double space token as always, there are no special cases for space vs other tokens during CTC/RNNT decoding - only blank is treated as a special case

VahidooX commented 2 years ago

You are right! I checked the sentencepiece tokenizers, they have the space in the vocab "▁" as a single word-piece.

titu1994 commented 2 years ago

In Sentencepiece, _ itself is overloaded - by itself it means subword start, and when used as _word, of means space and then word. _ defines double space that may exist in some incorrect text (or maybe the text actually needs double space for some reason, but it's odd that sentencepiece would choose such a representation than use two seperate )

itzsimpl commented 2 years ago

That's all true, and I see no problem with this. On the other hand, I see the ctc_decoder_predictions_tensor when decoding predictions goes over them and compresses equal tokens that are separated by blank tokens into a single token. One example would be:

a a a b b [blank] b o t [blank] t  -->  a b b o t t

this procedure will lead to double or multiple spaces for example in

t t t o o _ _ [blank] _ _ m e e  -->  t o _ _ m e

or

t t t o o _ _ _b _b e e  -->  t o _ _b e

I am assuming detokenization is performed as described in Sentencepiece, i.e. first joining pieces and then replacing _ for space detokenized = ''.join(pieces).replace('▁', ' ').

IMHO, the real question here is why would the ASR separate spaces (silences) with a [blank] symbol if there are no examples with dobuble/multiple consecutive spaces in the training set. And also, is it in the second example above still correct to leave a double space as the result. Note that I am only hypothesising here, but in this specific case one space could be a result of the ASR recognising silence and the other a result of the ASR recognising the start of one word that happens to start with a token that has _ prepended.

NVIDIA / NeMo

Conformer CTC occasionally returns transcripts with multiple consecutive spaces #4034