Closed itzsimpl closed 2 years ago
Wait, Huggingface tokenizer has such a flag !? We've been having this issue with WPE models for so long.. FYI @VahidooX
We need to be a bit careful, since that tokenizer is shared with NLP and NMT so we don't want to ruin their models, perhaps there can be some clean way of handling this for ASR alone (cause by default, we never want spaces to be injected by a tokenizer)
@titu1994 What I was suggesting was to simply add a cleanup step before returning the result to https://github.com/NVIDIA/NeMo/blob/91e6c2552a9ca8d00bf8ce758adea68d9c49f803/nemo/collections/asr/metrics/wer.py#L188-L213 and the corresponding BPE versions https://github.com/NVIDIA/NeMo/blob/91e6c2552a9ca8d00bf8ce758adea68d9c49f803/nemo/collections/asr/metrics/wer_bpe.py#L147-L172 so if my understanding of the codebase is correct, this would affect only ASR. Am I mistaken?
No we won't be performing pre and postprocessing steps in ASR Tokenizers. Our recommendation is to not use hugging face wpe tokenizer for anything other than Librispeech.
Note that this issue is just with the WPE tokenizer and sentencepiece tokenizers don't have it.
Strange you say this is just with the WPE tokenizer, as I am using a SentencePiece unigram tokenizer.
More so, as IMHO the issue is not caused by the tokenizer itself, but by the approach with which ctc_decoder_predictions_tensor
drops blank_id
tokens, i.e. it does not treat for the special case where a blank_id
separates two space tokens. This means that as long as the tokenizer can output a special space token, the issue can happen.
If there's a blank between two spaces, we consider it as double space token as always, there are no special cases for space vs other tokens during CTC/RNNT decoding - only blank is treated as a special case
You are right! I checked the sentencepiece tokenizers, they have the space in the vocab "▁" as a single word-piece.
In Sentencepiece, _ itself is overloaded - by itself it means subword start, and when used as _word, of means space and then word. _ defines double space that may exist in some incorrect text (or maybe the text actually needs double space for some reason, but it's odd that sentencepiece would choose such a representation than use two seperate )
That's all true, and I see no problem with this. On the other hand, I see the ctc_decoder_predictions_tensor
when decoding predictions goes over them and compresses equal tokens that are separated by blank tokens into a single token. One example would be:
a a a b b [blank] b o t [blank] t --> a b b o t t
this procedure will lead to double or multiple spaces for example in
t t t o o _ _ [blank] _ _ m e e --> t o _ _ m e
or
t t t o o _ _ _b _b e e --> t o _ _b e
I am assuming detokenization is performed as described in Sentencepiece, i.e. first joining pieces and then replacing _ for space detokenized = ''.join(pieces).replace('▁', ' ')
.
IMHO, the real question here is why would the ASR separate spaces (silences) with a [blank] symbol if there are no examples with dobuble/multiple consecutive spaces in the training set. And also, is it in the second example above still correct to leave a double space as the result. Note that I am only hypothesising here, but in this specific case one space could be a result of the ASR recognising silence and the other a result of the ASR recognising the start of one word that happens to start with a token that has _ prepended.
While analysing datasets via SpeechDataExplorer I discovered that transcripts (in our case generated with Conformer CTC BPE) occasionally contain multiple consecutive spaces. This influences the computation of CER, but is also very confusing because the
text diff
section in SDE does not show any visual differences. Thetext diff
also fails to show differences for UTF-8 symbols that appear in different versions in text and pred_text (composed and decomposed form), but the difference is correctly taken into account when computing WER and CER, hence this becomes confusing because there is no visual difference between the two transcripts.Tracking down the source for multiple consecutive spaces I see that this can occur if a
blank_id
token separates consecutive space characters: https://github.com/NVIDIA/NeMo/blob/91e6c2552a9ca8d00bf8ce758adea68d9c49f803/nemo/collections/asr/metrics/wer.py#L162-L165Perhaps following Huggingface
tokenizer.decode
a flagclean_up_tokenization_spaces
with the default value ofTrue
could be added todecode_tokens_to_str
anddecode_ids_to_tokens
. The sole purpose of this flag would be to clean up the resulting data of multiple consecutive space symbols. If you agree I can make a PR.