NeMo no longer supports transcripts with diacritics

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

11.49k stars 2.41k forks source link

NeMo no longer supports transcripts with diacritics #3795

Closed piraka9011 closed 2 years ago

piraka9011 commented 2 years ago

Describe the bug

I am training an Arabic model with diacritics. Digitally, each diacritic is represented as a separate (unicode) character from the actual letter. Here's an example of the vocabulary/text that we are using. There are 10,878 unique words in the vocabulary (so large enough for a SPE tokenizer with a vocab_size of 1024).

I am able to generate a SPE-based tokenizer using our diacritized vocabulary. I can confirm that the generated document.txt and vocab.txt for the tokenizer have Arabic text represented correctly. However, somewhere during training the tokenizer fails to decode the text properly.

This is an example output from the WER metric:

[NeMo I 2022-03-04 09:48:36 wer_bpe:204] reference:غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇
[NeMo I 2022-03-04 09:48:36 wer_bpe:205] predicted:جعلنا غبعوثون الض تجري غ ض غ استكبروا غ استكبرواࣲ غ وللكافرين إ وإليه الض قلتم استكبروابعوثون غ وإلى أظلم استكبروا س استكبروا غ عنهم أحسن استكبروا جعلنا الض الجاه ]
تجري الهمد للذين غ استكبروا القرآن غر جعلنا غ رؤوس أصببعوثون ضآيات شديد شديد وإلى غࣲ استكبروا غ استكبروا غ جعلنا غ أحسن الض ألفافا أنفسكم والسماء آتينا لله غ ض آتينا غ جعلنا الض غ قلتم آتينا الجاه تجري غ وأخ جعلنا ]
⁇  ألفافا جعلنا غإ غ استكبروا وأخ وإلى وأخ غ إ غ جعلنا تجري وإلى استكبروابعوثون

Notice two things here:

The reference text is only two characters: ⁇ and غ.
The predicted text does not have any diacritics (despite the tokenizer and vocab having diacritics). After a few epochs, the model "converges" and predicts only those two characters. Basically the loss goes to 0 in ~1k steps.

[NeMo I 2022-03-04 09:49:46 wer_bpe:204] reference:غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇
[NeMo I 2022-03-04 09:49:46 wer_bpe:205] predicted:غ ⁇  غ ⁇  جعلنا جعلنا ⁇  وإلى غ ⁇  ⁇  غ ⁇  غ ⁇  ⁇  غ ⁇  جعلنا عليها ⁇  ألفافا جعلنا ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  لله بأفواه

If I train a model without diacritics, the same behavior occurs. The last working NeMo version was 1.4.0 (will test with 1.5.0). Previously, the model would converge at a reasonable rate and achieves good results (~4% WER).

Steps/Code to reproduce bug

There's nothing really special about my training script, it's taken from the examples. The model is finetuned from the stt_en_citrinet_1024 model. cfg.model_target is nemo.collections.asr.models.EncDecCTCModelBPE

from omegaconf import OmegaConf
import pytorch_lightning as pl
import wandb

from nemo.collections.asr.models import ASRModel
from nemo.core.config import hydra_runner
from nemo.utils import logging, model_utils
from nemo.utils.exp_manager import exp_manager

@hydra_runner(config_path="conf/citrinet/", config_name="config")
def main(cfg):
    # Setup trainer and exp. manager
    trainer = pl.Trainer(**cfg.trainer)
    log_dir = exp_manager(trainer, cfg.get("exp_manager", None))
    # Setup Model
    model_class = model_utils.import_class_by_path(cfg.model_target)  # type: ASRModel
    asr_model = model_class.from_pretrained(model_name=cfg.init_from_pretrained_model)
    asr_model.cfg = cfg.model
    asr_model.set_trainer(trainer)
    asr_model.setup_training_data(cfg.model.train_ds)
    asr_model.setup_multiple_validation_data(cfg.model.validation_ds)
    asr_model.setup_optimization(cfg.model.optim)
    # Setup Augmentation
    asr_model.spec_augmentation = asr_model.from_config_dict(cfg.model.spec_augment)
    # Change vocab
    asr_model.change_vocabulary(
        new_tokenizer_dir=cfg.model.tokenizer.dir,
        new_tokenizer_type=cfg.model.tokenizer.type
    )
    trainer.fit(asr_model)

if __name__ == '__main__':
    main()

The tokenizer is generated using process_asr_text_tokenizer.py:

python process_asr_text_tokenizer.py --manifest=<path to train manifest files, seperated by commas> \
         --data_root=tokenizers \
         --vocab_size=1024 \
         --tokenizer=spe \
         --log

Expected behavior

The model should be able to converge and predict Arabic text accurately.

Environment overview (please complete the following information)

Environment location: Docker (nvcr.io/nvidia/pytorch:21.10-py3)
Method of NeMo install: pip install nemo_toolkit[all]==1.7.0
If method of install is [Docker], provide docker pull & docker run commands used

Dockerfile:

FROM nvcr.io/nvidia/pytorch:21.10-py3

ARG DEBIAN_FRONTEND=noninteractive

COPY ./requirements.txt requirements.txt 

RUN apt update && \
    apt install -y ffmpeg libsndfile1 && \
    python3 -m pip install --upgrade pip && \
    python3 -m pip install -r requirements.txt

docker run --rm -it --gpus all --shm-size 64G --ipc=host --env-file .env -v /home/$USER:/home/$USER train

Additional context

GPU: 8xV100 (AWS p3dn.24xlarge)

titu1994 commented 2 years ago

Thanks for the detailed info. The fact that even the ground truth cannot properly tokenize the text points to an issue with the tokenizer so let me get some additional info - you are using sentencepiece tokenizer with vocab size of 1024, but your base character set - ie the total unique characters in the dataset is 10878 - that should throw an error in Sentencepiece itself during tokenizer construction. There is only one possible way to allow vocab size < base vocab size, and it is disabled by default (https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py#L58), but this has it's limits too - it can't go below 80% coverage.

You should also pass the --no_lower_case flag to prevent nkfd normalization by Sentencepiece (it is enabled by default when in Sentencepiece so we use that default too).

Now, the next test is to check if the tokenizer is working properly - load the model with the tokenizer then use model.tokenizer.text_to_ids() to get subword IDs, then use ...ids_to_text() to get back text and assert that the ground truth and this output are matching. If so, then tokenizer is working. If not, if means tokenizer construction was incorrect.

Next if the above works, then perhaps the data loader is not properly handling the characters (which doesn't make sense given the vocabulary is derived in Unicode from SPE itself, but we can look into it if that's the source)

piraka9011 commented 2 years ago

The total unique characters in the dataset is 10878

No, there are 46 characters (36 letters and 10 diacritics). There are 10,878 unique words.

Now, the next test is to check if the tokenizer is working properly

This makes sense, will test and report back. Will also regenerate the tokenizer with the --no_lower_case.

...perhaps the data loader is not properly handling the characters

Agree this doesn't make sense, but let's see once I try out the tokenizer approach.

piraka9011 commented 2 years ago

Generated a new tokenizer explicitly passing --no_lower_case

python process_asr_text_tokenizer.py --manifest=<path to train manifest files, seperated by commas> \
         --data_root=tokenizers/TLDv1 \
         --vocab_size=1024 \
         --tokenizer=spe \
         --no_lower_case \
         --log

Tested the tokenizer on some text:

s1 = "ماذا يفعَل المُخبرُ"
ids = tokenizer.text_to_ids(s1)
text = tokenizer.ids_to_text(ids)
text == s1
>>> True
s2 = "سَيَقولُ السُفَهاءُ مِنَ النَاسِ"
ids = tokenizer.text_to_ids(s2)
text = tokenizer.ids_to_text(ids)
text == s2
>>> True

I tested out of alphabet letters (ex. English) and the tokenizer returned an unknown (⁇) symbol character. I also tested using a Buckwalter transliterated manifest and the tokenizer also worked fine. FYI, I used the lang_trans package to generate the Buckwalter manifest.

piraka9011 commented 2 years ago

FYI, I just tried training a QuartzNet Large (15x5) model and was able to get reasonable results within the first few epochs.

[NeMo I 2022-03-06 07:51:28 wer:226] reference:مَا ضَلَّ صَاحِبُكُمْ وَمَا غَوَى
[NeMo I 2022-03-06 07:51:28 wer:227] predicted:مَاظَلَ صَاحِبُكُمْ وَمَارُوَا
...
[NeMo I 2022-03-06 07:51:54 wer:226] reference:وَيْلٌ يَوْمَئِذٍ لِلْمُكَذِّبِينَ
[NeMo I 2022-03-06 07:51:54 wer:227] predicted:وَيْلُ ل يَهُمَإِدٍلِالْمُكَذِّبِينَ

Not sure if it's noticeable but the predicted text is close to the reference text (within ~3k steps/2nd Epoch).

titu1994 commented 2 years ago

And can you try with the new tokenizer to train a model ?

piraka9011 commented 2 years ago

I did and I get the same issue with the unk (??) characters :/

Here's an example prediction of a model trained for 26 epochs (~50k steps)

"مَبمَ ⁇ وَ ⁇ مَ ⁇ ز ⁇ سُمَ وَيَوْمَُ ⁇ بنمَ ⁇ نَ كب وَلََس وَمَ ⁇ زبِي ⁇ َهِإمَ ⁇ نَ كبسَز ⁇ مَ ⁇ ز ⁇ نَُ"

FWIW, here's a Gradio demo of a model trained w/ the tokenizer and a previous model we trained that was working fine: https://44250.gradio.app/

Here's some sample audio files. It only accepts WAV files w/ 16kHz sampling rate.

Here are the W&B logs for the model trained with the new tokenizer.

titu1994 commented 2 years ago

Can you provide your tokenizer dir, plus a few audio files which show ?? Inside of the reference text during training. I will try to debug this.

Just to note, are you writing the manifest file with encoding utf8? Your manifest file should have several \u... In the text field then rather than the actual text.

Also, can you try to force locale via LC_ALL=en_US.UTF-8 python ... And see if it changes anything ?

piraka9011 commented 2 years ago

Are you writing the manifest file with encoding utf8?

I am using utf8 encoding, but the text appears in it's "normal" form (not encoded with unicode escape symbol \u...). This is how I write my manifest:

with open(manifest_file, "w", encoding="utf8") as fd:
    for sample in samples:
        fd.writeline(f"{json.dumps(sample, ensure_ascii=False)}\n")

ensure_ascii=False is what prevents the text from being written with unicode escape characters.

I have uploaded the tokenizer we used to the previous link with sample files (TLDv1.zip) and will try generating a tokenizer/manifest with unicode escaped characters

Also, can you try to force locale via LC_ALL=en_US.UTF-8...

Sure, will give this a shot too.

piraka9011 commented 2 years ago

I still get the same issue even when setting ensure_ascii=True when creating the manifest and setting LC_ALL=en_US.UTF-8 as an env variable during training.

The training manifest has unicode encoded text, but eventually the process_asr_text_tokenizer.py script decodes the characters into the Arabic characters which can be seen from the tokenizer's tokenizer.vocab file.

Any other suggestions here?

titu1994 commented 2 years ago

Ah ok i was about to say to use ensure_ascii=True but it still seems to not work? Btw, process_asr_text_tokenizer will use your cached text, it will not recompute text from manifest if it detects text file already exists, so I hope you are deleting the raw text in that dir before subsequent retries.

Im actually out of ideas. Sorry about this but could you retrace the steps - delete the entire tokenizer dir (with text + tokenizers), recreate the manifest with ensure_ascii=True, recreate tokenizer using the process script, and retrain for a few epochs to see whats going on.

I tried the few audio files you sent in that drive to overfit it and see if ground truth gets corrupted or not..

titu1994 commented 2 years ago

Interestingly if I simply copy paste the manifest file here and try to train a model based off of this, nemo immediately crashes with json decoing error as expected.

Save the following content with utf-8 encoding -

{"audio_filepath": "2_21_wa436789.wav", "duration": 18.71, "text": "يا أيها الناس اعبدوا ربكم الذي خلقكم والذين من قبلكم لعلكم تتقون"}
{"audio_filepath": "20_1_ytkurdi.wav", "text": "طه ما أنزلنا عليك القرآن لتشقى إلا تذكرة لمن يخشى تنزيلا ممن خلق الأرض والسماوات العلا الرحمن على العرش استوى", "duration": 29.01}
{"audio_filepath": "36_1_ytnoreen.wav", "text": "يس والقرآن الحكيم إنك لمن المرسلين على صراط مستقيم ", "duration": 15.48}

then try

def read_manifest(path):
    manifest = []
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            print(line)
            data = json.loads(line, )
            manifest.append(data)
    return manifest

throws the error (both here and in nemo training script) -

File "/home/smajumdar/PycharmProjects/nemo-eval/finetuning/challenges/gram_vaani/notebooks/check_manifest_temp.py", line 102, in main
    manifest = read_manifest(manifest_path)
  File "/home/smajumdar/PycharmProjects/nemo-eval/finetuning/challenges/gram_vaani/notebooks/check_manifest_temp.py", line 18, in read_manifest
    data = json.loads(line, )
  File "/home/smajumdar/anaconda3/envs/NeMo/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/smajumdar/anaconda3/envs/NeMo/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/smajumdar/anaconda3/envs/NeMo/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

So how were you able to train with such encoding? The only way i could get it to work was when I used ensure_ascii=True so it wrote explicit \u... tokens, then nemo can load the manifest and start training.

titu1994 commented 2 years ago

I was able to train the model on those 3 files just as always. Heres the output after 1500 updates (train loss is ~ 0). The model overfit (22 M params, 3 audio files) but the ground truth is correct and there are no ?? marks during training or inference -

[NeMo I 2022-03-07 15:56:16 rnnt_wer_bpe:232] reference :يا أيها الناس اعبدوا ربكم الذي خلقكم والذين من قبلكم لعلكم تتقون
[NeMo I 2022-03-07 15:56:16 rnnt_wer_bpe:233] predicted :يا أيها الناس اعبدوا ربكم الذي خلقكم والذين من ق

[NeMo I 2022-03-07 15:56:16 rnnt_wer_bpe:232] reference :طه ما أنزلنا عليك القرآن لتشقى إلا تذكرة لمن يخشى تنزيلا ممن خلق الأرض والسماوات العلا الرحمن على العرش استوى
[NeMo I 2022-03-07 15:56:16 rnnt_wer_bpe:233] predicted :طه ما أنزلنا عليك القرآن لتشقى إلا تذكرة لمن يخشى تنزيلا ممن العلا الرحمن على العرش استوى

[NeMo I 2022-03-07 15:56:16 rnnt_wer_bpe:232] reference :يس والقرآن الحكيم إنك لمن المرسلين على صراط مستقيم
[NeMo I 2022-03-07 15:56:16 rnnt_wer_bpe:233] predicted :يس والقرآن الحكيم إنك لمن المرسلين على صراط مستقيم

Edit: Nvm, it converged to perfection after few more steps. So I am wondering if you need to redo the manifest creation + tokenizer construction.

[NeMo I 2022-03-07 16:03:00 rnnt_wer_bpe:232] reference :يا أيها الناس اعبدوا ربكم الذي خلقكم والذين من قبلكم لعلكم تتقون
[NeMo I 2022-03-07 16:03:00 rnnt_wer_bpe:233] predicted :يا أيها الناس اعبدوا ربكم الذي خلقكم والذين من قبلكم لعلكم تتقون

[NeMo I 2022-03-07 16:03:00 rnnt_wer_bpe:232] reference :طه ما أنزلنا عليك القرآن لتشقى إلا تذكرة لمن يخشى تنزيلا ممن خلق الأرض والسماوات العلا الرحمن على العرش استوى
[NeMo I 2022-03-07 16:03:00 rnnt_wer_bpe:233] predicted :طه ما أنزلنا عليك القرآن لتشقى إلا تذكرة لمن يخشى تنزيلا ممن خلق الأرض والسماوات العلا الرحمن على العرش استوى

[NeMo I 2022-03-07 16:03:01 rnnt_wer_bpe:232] reference :يس والقرآن الحكيم إنك لمن المرسلين على صراط مستقيم
[NeMo I 2022-03-07 16:03:01 rnnt_wer_bpe:233] predicted :يس والقرآن الحكيم إنك لمن المرسلين على صراط مستقيم

Epoch 1899: 100%|██████████████████████████████████| 4/4 [00:02<00:00,  1.43it/s, loss=0.00301, v_num=1-38Epoch 1899, global step 1899: val_wer reached 0.00000 (best 0.00000), ...

piraka9011 commented 2 years ago

Very interesting...This is good, but the text doesn't have diacritics 😅 To confirm, you are using the same configuration/tokenizer creation process I described earlier? Here's the same manifest diacritized:

{"audio_filepath": "2_21_wa436789.wav", "duration": 18.71, "text": "يَا أَيُّهَا النَّاسُ اعْبُدُوا رَبَّكُمْ الَّذِي خَلَقَكُمْ وَاَلَّذِينَ مِنْ قَبْلِكُمْ لَعَلَّكُمْ تَتَّقُونَ"}
{"audio_filepath": "20_1_ytkurdi.wav", "text": "طه مَا أَنْزَلْنَا عَلَيْك الْقُرْآنَ لَتُشْقَّى إلَّا تَذْكِرَةً لِمَنْ يَخْشَى تَنْزِيلًا مِمَّنْ خَلَقَ الْأَرْضَ وَالسَّمَاوَاتِ الْعِلا الرَّحْمَنَ عَلَى الْعَرْشِ اسْتَوَى", "duration": 29.01}
{"audio_filepath": "36_1_ytnoreen.wav", "text": "يس وَالْقُرْآنُ الْحَكِيمُ إنَّك لِمِنْ الْمُرْسَلِينَ عَلَى صِرَاطٍ مُسْتَقِيمٍ", "duration": 15.48}

Let me try reproducing your results using this manifest as well.

Also, it seems like you are using conformer rnnt and not citrinet. Is that intended?

piraka9011 commented 2 years ago

Re: the JSONDecodeError, I had to make sure there was no extra line at the end of the manifest file.

titu1994 commented 2 years ago

This is good, but the text doesn't have diacritics I utilized the manifest you provided - I don't know enough about the language to be able to discern diacritics. If you possess audio files with text that does have diacritic (as provided in your update), I'll redo the experiment and report back tomorrow.

The encoder is not the issue, it has practically no interaction with text - it would be at maximum the prediction network or the data loader / tokenizer that is causing the issue. I can redo with a citrinet ctc too, it would just take longer to converge and I wont be able to cleanly separate encoder interaction with text.

Thanks, I was able to get the JSONDecodeError fixed with your suggestion. Training a new model with that manifest shows nearly the same result, convergence in around 1750 update steps.

piraka9011 commented 2 years ago

Training a new model with that manifest shows nearly the same result, convergence in around 1750 update steps.

Ok let me try reproducing using this minimal manifest. I can't seem to train a CitriNet model as I get the following error:

Traceback (most recent call last):
  File "train.py", line 66, in main
    trainer.fit(asr_model)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1375, in _run_sanity_check
    self._evaluation_loop.run()
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
    output = self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
    output = self.trainer.accelerator.validation_step(step_kwargs)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 239, in validation_step
    return self.training_type_plugin.validation_step(*step_kwargs.values())
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 444, in validation_step
    return self.model(*args, **kwargs)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 92, in forward
    output = self.module.validation_step(*inputs, **kwargs)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/nemo/collections/asr/models/ctc_models.py", line 650, in validation_step
    self._wer.update(
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/torchmetrics/metric.py", line 263, in wrapped_func
    return update(*args, **kwargs)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/nemo/collections/asr/metrics/wer_bpe.py", line 195, in update
    reference = self.decode_tokens_to_str(target)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/nemo/collections/asr/metrics/wer_bpe.py", line 148, in decode_tokens_to_str
    hypothesis = self.tokenizer.ids_to_text(tokens)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/nemo/collections/common/tokenizers/sentencepiece_tokenizer.py", line 142, in ids_to_text
    return self.tokenizer.decode_ids(ids)
  File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/sentencepiece/__init__.py", line 174, in DecodeIdsWithCheck
    return _sentencepiece.SentencePieceProcessor_DecodeIdsWithCheck(self, ids)
IndexError: Out of range: piece id is out of range.

Have you seen this before/any suggestions to debug?

I tried recreating the tokenizer with a smaller vocab size (1024 -> 48) and still didn't work

 python process_asr_tokenizer.py --manifest /audio-data/debug-nemo/manifest.json --data_root tokenizers/debug-nemo --vocab_size 48 --tokenizer spe --spe_type unigram --no_lower_case --log

Edit: did some debugging, and the ids appear to be similar to the behavior mentioned earlier.

[89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0]

I'm starting to think this is an issue w/ Sentencepiece and/or maybe torch...

titu1994 commented 2 years ago

This is the manifest after loading and writing to another file with the ensure_ascii=True flag for json.dumps()

{"audio_filepath": "2_21_wa436789.wav", "duration": 18.71, "text": "\u064a\u064e\u0627 \u0623\u064e\u064a\u064f\u0651\u0647\u064e\u0627 \u0627\u0644\u0646\u064e\u0651\u0627\u0633\u064f \u0627\u0639\u0652\u0628\u064f\u062f\u064f\u0648\u0627 \u0631\u064e\u0628\u064e\u0651\u0643\u064f\u0645\u0652 \u0627\u0644\u064e\u0651\u0630\u0650\u064a \u062e\u064e\u0644\u064e\u0642\u064e\u0643\u064f\u0645\u0652 \u0648\u064e\u0627\u064e\u0644\u064e\u0651\u0630\u0650\u064a\u0646\u064e \u0645\u0650\u0646\u0652 \u0642\u064e\u0628\u0652\u0644\u0650\u0643\u064f\u0645\u0652 \u0644\u064e\u0639\u064e\u0644\u064e\u0651\u0643\u064f\u0645\u0652 \u062a\u064e\u062a\u064e\u0651\u0642\u064f\u0648\u0646\u064e"}
{"audio_filepath": "20_1_ytkurdi.wav", "text": "\u0637\u0647 \u0645\u064e\u0627 \u0623\u064e\u0646\u0652\u0632\u064e\u0644\u0652\u0646\u064e\u0627 \u0639\u064e\u0644\u064e\u064a\u0652\u0643 \u0627\u0644\u0652\u0642\u064f\u0631\u0652\u0622\u0646\u064e \u0644\u064e\u062a\u064f\u0634\u0652\u0642\u064e\u0651\u0649 \u0625\u0644\u064e\u0651\u0627 \u062a\u064e\u0630\u0652\u0643\u0650\u0631\u064e\u0629\u064b \u0644\u0650\u0645\u064e\u0646\u0652 \u064a\u064e\u062e\u0652\u0634\u064e\u0649 \u062a\u064e\u0646\u0652\u0632\u0650\u064a\u0644\u064b\u0627 \u0645\u0650\u0645\u064e\u0651\u0646\u0652 \u062e\u064e\u0644\u064e\u0642\u064e \u0627\u0644\u0652\u0623\u064e\u0631\u0652\u0636\u064e \u0648\u064e\u0627\u0644\u0633\u064e\u0651\u0645\u064e\u0627\u0648\u064e\u0627\u062a\u0650 \u0627\u0644\u0652\u0639\u0650\u0644\u0627 \u0627\u0644\u0631\u064e\u0651\u062d\u0652\u0645\u064e\u0646\u064e \u0639\u064e\u0644\u064e\u0649 \u0627\u0644\u0652\u0639\u064e\u0631\u0652\u0634\u0650 \u0627\u0633\u0652\u062a\u064e\u0648\u064e\u0649", "duration": 29.01}
{"audio_filepath": "36_1_ytnoreen.wav", "text": "\u064a\u0633 \u0648\u064e\u0627\u0644\u0652\u0642\u064f\u0631\u0652\u0622\u0646\u064f \u0627\u0644\u0652\u062d\u064e\u0643\u0650\u064a\u0645\u064f \u0625\u0646\u064e\u0651\u0643 \u0644\u0650\u0645\u0650\u0646\u0652 \u0627\u0644\u0652\u0645\u064f\u0631\u0652\u0633\u064e\u0644\u0650\u064a\u0646\u064e \u0639\u064e\u0644\u064e\u0649 \u0635\u0650\u0631\u064e\u0627\u0637\u064d \u0645\u064f\u0633\u0652\u062a\u064e\u0642\u0650\u064a\u0645\u064d", "duration": 15.48}

Training seems to progress fine, converging after 1100 steps -


[NeMo I 2022-03-07 17:54:39 rnnt_wer_bpe:232] reference :يَا أَيُّهَا النَّاسُ اعْبُدُوا رَبَّكُمْ الَّذِي خَلَقَكُمْ وَاَلَّذِينَ مِنْ قَبْلِكُمْ لَعَلَّكُمْ تَتَّقُونَ
[NeMo I 2022-03-07 17:54:39 rnnt_wer_bpe:233] predicted :يَا أَيُّهَا النَّاسُ اعْبُدُوا رَبَّكُمْ الَّذِي خَلَقَكُمْ وَاَلَّذِينَ مِنْ قَبْلِكُمْ لَعَلَّكُمْ تَتَّقُونَ

[NeMo I 2022-03-07 17:54:40 rnnt_wer_bpe:232] reference :طه مَا أَنْزَلْنَا عَلَيْك الْقُرْآنَ لَتُشْقَّى إلَّا تَذْكِرَةً لِمَنْ يَخْشَى تَنْزِيلًا مِمَّنْ خَلَقَ الْأَرْضَ وَالسَّمَاوَاتِ الْعِلا الرَّحْمَنَ عَلَى الْعَرْشِ اسْتَوَى
[NeMo I 2022-03-07 17:54:40 rnnt_wer_bpe:233] predicted :طه مَا أَنْزَلْنَا عَلَيْك الْقُرْآنَ لَتُشْقَّى إلَّا تَذْكِرَةً لِمَنْ يَخْشَى تَنْزِيلًا مِمَّنْ خَلَقَ الْأَرْضَ وَالسَّمَاوَاتِ الْعِلا الرَّحْمَنَ عَلَى الْعَرْشِ اسْتَوَى

[NeMo I 2022-03-07 17:54:40 rnnt_wer_bpe:232] reference :يس وَالْقُرْآنُ الْحَكِيمُ إنَّك لِمِنْ الْمُرْسَلِينَ عَلَى صِرَاطٍ مُسْتَقِيمٍ
[NeMo I 2022-03-07 17:54:40 rnnt_wer_bpe:233] predicted :يس وَالْقُرْآنُ الْحَكِيمُ إنَّك لِمِنْ الْمُرْسَلِينَ عَلَى صِرَاطٍ مُسْتَقِيمٍ

Epoch 1099: 100%|█████████| 4/4 [00:02<00:00,  1.44it/s, loss=0.157, v_num=6-42
Epoch 1099, global step 1099: val_wer reached 0.00000

The error above occurs when sentencepiece tries to detokenize a subword but with wrong vocabulary index - for example using a CTC/RNNT decoder with 1024+1 vocab size but a sentencepiece constructed with only 48 tokens will raise this error. I am using the tokenizer provided by you, without reconstruction via text. Simply doing that it seems to work well, the WER is dropping correctly and quickly.

Attached the model config + train script (rename the .txt out of them). conformer_transducer_bpe.yaml.txt speech_to_text_rnnt_bpe.py.txt

piraka9011 commented 2 years ago

I think I found the issue...

If I do not initialize the model from a pre-trained model (ex. stt_en_citrinet_1024) then I am able to see the reference/predicted text accurately and the model overfits on the 3-file manifest. 🎉

However, in order for us to get good results on our dataset, we need to finetune from a pre-trained model.

It seems something must have changed with the model hosted over at NGC which is why we suddenly could not longer reproduce our previous results.

piraka9011 commented 2 years ago

If my observation is true, is it possible to use an old version of stt_en_citrinet_1024? If not, what language do you recommend we fine tune from?

I am aware that the Riva team is exploring training Arabic models (and we're working on preparing some public datasets for use), but that probably won't be for a few months I'm guessing.

titu1994 commented 2 years ago

All NGC versions are pointed to on the NGC page of the model - you can manually download any Nemo file and load that for finetuning.

I don't think we have updated that model card anytime soon. And even if we did, it's vocabulary would update with the change_vocabulary() step. So why would the models predictions be ?? during training ?

titu1994 commented 2 years ago

Oh wait.. I've just realized your dataset setup is before your tokenizer setup. That would explain what's happening -

1) your text gets tokenized by old internal tokenizer if the model before change. This gets preserved. 2) you change tokenizer, old subwords no longer map to new IDs, and the pretokenized text no longer corresponds to New tokenizer IDs. 3) your vocab and everything updates to new tokenizer but your pretokenized text is old tokenizer IDs so when you decide it is maps to random subwords.

Try pushing the change vocabulary right after setup trainer.

piraka9011 commented 2 years ago

Try pushing the change vocabulary right after setup trainer.

Wow... that is so nuanced! And it fixed the issue 🎉 I don't think it would have been possible to think that was the issue if someone didn't have insight into NeMo's internals.

This really needs to be documented/commented/highlighted somewhere... anywhere. At least in the cross-language fine-tuning tutorial or something.

Thank you so much for helping me debug this @titu1994! Appreciate the time spent going through this.

titu1994 commented 2 years ago

Agreed that we need to improve documentation, such issues are subtle and incredibly hard to debug. Theres no way to even test this easily to raise an error.

titu1994 commented 2 years ago

Would you comment on if such a execution diagram would avoid such future mistakes ? https://github.com/NVIDIA/NeMo/pull/3812

To check the result you need to visit the branch - https://github.com/titu1994/NeMo/tree/high_level_asr_diag/examples/asr/asr_ctc

itzsimpl commented 2 years ago

@titu1994 a diagram like that is of course helpful. I keep wandering though about the following: Say one uses just the yaml config file, provides their own train, validation, test datasets, tokenizer, and also the yaml config parameter init_from_pretrained or init_from_nemo_model. What will happen in that case, will the order of execution be correct.

I'm asking this because, in the case of the Arabic model initiated from English, the miss-ordering lead to UNK symbols. However, I would assume that in the case of some other latin character based model, the "error", as you said, could be much more subtle and hard to spot (resulting in just in a higher WER then optimal, lost/un-transcribed words, or sth.).

piraka9011 commented 2 years ago

Ditto, this diagram is helpful. Will leave comments on the PR.

If possible, updating the ASR_CTC_Language_Finetuning tutorial would also be helpful. Just a sentence in the "Update the vocabulary" section saying something along the lines of "The change_vocabulary() must be performed before setting up your datasets otherwise you might get decoding errors."

titu1994 commented 2 years ago

@piraka9011 In general ill add a note to the very top of the notebook to review the execution flow diagram before starting finetuning.

titu1994 commented 2 years ago

@itzsimpl The ordering given there is what happens inside the constructor of the nemo model itself. When you use init from pretrained - it will first load your model with your config and data loaders etc and only load the pytorch checkpoint weights from the older model into your already initialized model.

The older model's tokenizer or dataloaders are not used at all, only its weights are copied into your new model.