Closed piraka9011 closed 2 years ago
Thanks for the detailed info. The fact that even the ground truth cannot properly tokenize the text points to an issue with the tokenizer so let me get some additional info - you are using sentencepiece tokenizer with vocab size of 1024, but your base character set - ie the total unique characters in the dataset is 10878 - that should throw an error in Sentencepiece itself during tokenizer construction. There is only one possible way to allow vocab size < base vocab size, and it is disabled by default (https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py#L58), but this has it's limits too - it can't go below 80% coverage.
You should also pass the --no_lower_case flag to prevent nkfd normalization by Sentencepiece (it is enabled by default when in Sentencepiece so we use that default too).
Now, the next test is to check if the tokenizer is working properly - load the model with the tokenizer then use model.tokenizer.text_to_ids() to get subword IDs, then use ...ids_to_text() to get back text and assert that the ground truth and this output are matching. If so, then tokenizer is working. If not, if means tokenizer construction was incorrect.
Next if the above works, then perhaps the data loader is not properly handling the characters (which doesn't make sense given the vocabulary is derived in Unicode from SPE itself, but we can look into it if that's the source)
The total unique characters in the dataset is 10878
No, there are 46 characters (36 letters and 10 diacritics). There are 10,878 unique words.
Now, the next test is to check if the tokenizer is working properly
This makes sense, will test and report back. Will also regenerate the tokenizer with the --no_lower_case
.
...perhaps the data loader is not properly handling the characters
Agree this doesn't make sense, but let's see once I try out the tokenizer approach.
Generated a new tokenizer explicitly passing --no_lower_case
python process_asr_text_tokenizer.py --manifest=<path to train manifest files, seperated by commas> \
--data_root=tokenizers/TLDv1 \
--vocab_size=1024 \
--tokenizer=spe \
--no_lower_case \
--log
Tested the tokenizer on some text:
s1 = "ماذا يفعَل المُخبرُ"
ids = tokenizer.text_to_ids(s1)
text = tokenizer.ids_to_text(ids)
text == s1
>>> True
s2 = "سَيَقولُ السُفَهاءُ مِنَ النَاسِ"
ids = tokenizer.text_to_ids(s2)
text = tokenizer.ids_to_text(ids)
text == s2
>>> True
I tested out of alphabet letters (ex. English) and the tokenizer returned an unknown (⁇
) symbol character.
I also tested using a Buckwalter transliterated manifest and the tokenizer also worked fine.
FYI, I used the lang_trans
package to generate the Buckwalter manifest.
FYI, I just tried training a QuartzNet Large (15x5) model and was able to get reasonable results within the first few epochs.
[NeMo I 2022-03-06 07:51:28 wer:226] reference:مَا ضَلَّ صَاحِبُكُمْ وَمَا غَوَى
[NeMo I 2022-03-06 07:51:28 wer:227] predicted:مَاظَلَ صَاحِبُكُمْ وَمَارُوَا
...
[NeMo I 2022-03-06 07:51:54 wer:226] reference:وَيْلٌ يَوْمَئِذٍ لِلْمُكَذِّبِينَ
[NeMo I 2022-03-06 07:51:54 wer:227] predicted:وَيْلُ ل يَهُمَإِدٍلِالْمُكَذِّبِينَ
Not sure if it's noticeable but the predicted text is close to the reference text (within ~3k steps/2nd Epoch).
And can you try with the new tokenizer to train a model ?
I did and I get the same issue with the unk
(??
) characters :/
Here's an example prediction of a model trained for 26 epochs (~50k steps)
"مَبمَ ⁇ وَ ⁇ مَ ⁇ ز ⁇ سُمَ وَيَوْمَُ ⁇ بنمَ ⁇ نَ كب وَلََس وَمَ ⁇ زبِي ⁇ َهِإمَ ⁇ نَ كبسَز ⁇ مَ ⁇ ز ⁇ نَُ"
FWIW, here's a Gradio demo of a model trained w/ the tokenizer and a previous model we trained that was working fine: https://44250.gradio.app/
Here's some sample audio files. It only accepts WAV files w/ 16kHz sampling rate.
Here are the W&B logs for the model trained with the new tokenizer.
Can you provide your tokenizer dir, plus a few audio files which show ?? Inside of the reference text during training. I will try to debug this.
Just to note, are you writing the manifest file with encoding utf8? Your manifest file should have several \u... In the text field then rather than the actual text.
Also, can you try to force locale via LC_ALL=en_US.UTF-8 python ... And see if it changes anything ?
Are you writing the manifest file with encoding utf8?
I am using utf8
encoding, but the text appears in it's "normal" form (not encoded with unicode escape symbol \u...
).
This is how I write my manifest:
with open(manifest_file, "w", encoding="utf8") as fd:
for sample in samples:
fd.writeline(f"{json.dumps(sample, ensure_ascii=False)}\n")
ensure_ascii=False
is what prevents the text from being written with unicode escape characters.
I have uploaded the tokenizer we used to the previous link with sample files (TLDv1.zip
) and will try generating a tokenizer/manifest with unicode escaped characters
Also, can you try to force locale via LC_ALL=en_US.UTF-8...
Sure, will give this a shot too.
I still get the same issue even when setting ensure_ascii=True
when creating the manifest and setting LC_ALL=en_US.UTF-8
as an env variable during training.
The training manifest has unicode encoded text, but eventually the process_asr_text_tokenizer.py
script decodes the characters into the Arabic characters which can be seen from the tokenizer's tokenizer.vocab
file.
Any other suggestions here?
Ah ok i was about to say to use ensure_ascii=True
but it still seems to not work? Btw, process_asr_text_tokenizer will use your cached text, it will not recompute text from manifest if it detects text file already exists, so I hope you are deleting the raw text in that dir before subsequent retries.
Im actually out of ideas. Sorry about this but could you retrace the steps - delete the entire tokenizer dir (with text + tokenizers), recreate the manifest with ensure_ascii=True, recreate tokenizer using the process script, and retrain for a few epochs to see whats going on.
I tried the few audio files you sent in that drive to overfit it and see if ground truth gets corrupted or not..
Interestingly if I simply copy paste the manifest file here and try to train a model based off of this, nemo immediately crashes with json decoing error as expected.
Save the following content with utf-8 encoding -
{"audio_filepath": "2_21_wa436789.wav", "duration": 18.71, "text": "يا أيها الناس اعبدوا ربكم الذي خلقكم والذين من قبلكم لعلكم تتقون"}
{"audio_filepath": "20_1_ytkurdi.wav", "text": "طه ما أنزلنا عليك القرآن لتشقى إلا تذكرة لمن يخشى تنزيلا ممن خلق الأرض والسماوات العلا الرحمن على العرش استوى", "duration": 29.01}
{"audio_filepath": "36_1_ytnoreen.wav", "text": "يس والقرآن الحكيم إنك لمن المرسلين على صراط مستقيم ", "duration": 15.48}
then try
def read_manifest(path):
manifest = []
with open(path, 'r', encoding='utf-8') as f:
for line in f:
print(line)
data = json.loads(line, )
manifest.append(data)
return manifest
throws the error (both here and in nemo training script) -
File "/home/smajumdar/PycharmProjects/nemo-eval/finetuning/challenges/gram_vaani/notebooks/check_manifest_temp.py", line 102, in main
manifest = read_manifest(manifest_path)
File "/home/smajumdar/PycharmProjects/nemo-eval/finetuning/challenges/gram_vaani/notebooks/check_manifest_temp.py", line 18, in read_manifest
data = json.loads(line, )
File "/home/smajumdar/anaconda3/envs/NeMo/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/home/smajumdar/anaconda3/envs/NeMo/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/smajumdar/anaconda3/envs/NeMo/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)
So how were you able to train with such encoding? The only way i could get it to work was when I used ensure_ascii=True
so it wrote explicit \u... tokens, then nemo can load the manifest and start training.
I was able to train the model on those 3 files just as always. Heres the output after 1500 updates (train loss is ~ 0). The model overfit (22 M params, 3 audio files) but the ground truth is correct and there are no ?? marks during training or inference -
[NeMo I 2022-03-07 15:56:16 rnnt_wer_bpe:232] reference :يا أيها الناس اعبدوا ربكم الذي خلقكم والذين من قبلكم لعلكم تتقون
[NeMo I 2022-03-07 15:56:16 rnnt_wer_bpe:233] predicted :يا أيها الناس اعبدوا ربكم الذي خلقكم والذين من ق
[NeMo I 2022-03-07 15:56:16 rnnt_wer_bpe:232] reference :طه ما أنزلنا عليك القرآن لتشقى إلا تذكرة لمن يخشى تنزيلا ممن خلق الأرض والسماوات العلا الرحمن على العرش استوى
[NeMo I 2022-03-07 15:56:16 rnnt_wer_bpe:233] predicted :طه ما أنزلنا عليك القرآن لتشقى إلا تذكرة لمن يخشى تنزيلا ممن العلا الرحمن على العرش استوى
[NeMo I 2022-03-07 15:56:16 rnnt_wer_bpe:232] reference :يس والقرآن الحكيم إنك لمن المرسلين على صراط مستقيم
[NeMo I 2022-03-07 15:56:16 rnnt_wer_bpe:233] predicted :يس والقرآن الحكيم إنك لمن المرسلين على صراط مستقيم
Edit: Nvm, it converged to perfection after few more steps. So I am wondering if you need to redo the manifest creation + tokenizer construction.
[NeMo I 2022-03-07 16:03:00 rnnt_wer_bpe:232] reference :يا أيها الناس اعبدوا ربكم الذي خلقكم والذين من قبلكم لعلكم تتقون
[NeMo I 2022-03-07 16:03:00 rnnt_wer_bpe:233] predicted :يا أيها الناس اعبدوا ربكم الذي خلقكم والذين من قبلكم لعلكم تتقون
[NeMo I 2022-03-07 16:03:00 rnnt_wer_bpe:232] reference :طه ما أنزلنا عليك القرآن لتشقى إلا تذكرة لمن يخشى تنزيلا ممن خلق الأرض والسماوات العلا الرحمن على العرش استوى
[NeMo I 2022-03-07 16:03:00 rnnt_wer_bpe:233] predicted :طه ما أنزلنا عليك القرآن لتشقى إلا تذكرة لمن يخشى تنزيلا ممن خلق الأرض والسماوات العلا الرحمن على العرش استوى
[NeMo I 2022-03-07 16:03:01 rnnt_wer_bpe:232] reference :يس والقرآن الحكيم إنك لمن المرسلين على صراط مستقيم
[NeMo I 2022-03-07 16:03:01 rnnt_wer_bpe:233] predicted :يس والقرآن الحكيم إنك لمن المرسلين على صراط مستقيم
Epoch 1899: 100%|██████████████████████████████████| 4/4 [00:02<00:00, 1.43it/s, loss=0.00301, v_num=1-38Epoch 1899, global step 1899: val_wer reached 0.00000 (best 0.00000), ...
Very interesting...This is good, but the text doesn't have diacritics 😅 To confirm, you are using the same configuration/tokenizer creation process I described earlier? Here's the same manifest diacritized:
{"audio_filepath": "2_21_wa436789.wav", "duration": 18.71, "text": "يَا أَيُّهَا النَّاسُ اعْبُدُوا رَبَّكُمْ الَّذِي خَلَقَكُمْ وَاَلَّذِينَ مِنْ قَبْلِكُمْ لَعَلَّكُمْ تَتَّقُونَ"}
{"audio_filepath": "20_1_ytkurdi.wav", "text": "طه مَا أَنْزَلْنَا عَلَيْك الْقُرْآنَ لَتُشْقَّى إلَّا تَذْكِرَةً لِمَنْ يَخْشَى تَنْزِيلًا مِمَّنْ خَلَقَ الْأَرْضَ وَالسَّمَاوَاتِ الْعِلا الرَّحْمَنَ عَلَى الْعَرْشِ اسْتَوَى", "duration": 29.01}
{"audio_filepath": "36_1_ytnoreen.wav", "text": "يس وَالْقُرْآنُ الْحَكِيمُ إنَّك لِمِنْ الْمُرْسَلِينَ عَلَى صِرَاطٍ مُسْتَقِيمٍ", "duration": 15.48}
Let me try reproducing your results using this manifest as well.
Also, it seems like you are using conformer rnnt and not citrinet. Is that intended?
Re: the JSONDecodeError
, I had to make sure there was no extra line at the end of the manifest file.
This is good, but the text doesn't have diacritics I utilized the manifest you provided - I don't know enough about the language to be able to discern diacritics. If you possess audio files with text that does have diacritic (as provided in your update), I'll redo the experiment and report back tomorrow.
The encoder is not the issue, it has practically no interaction with text - it would be at maximum the prediction network or the data loader / tokenizer that is causing the issue. I can redo with a citrinet ctc too, it would just take longer to converge and I wont be able to cleanly separate encoder interaction with text.
Thanks, I was able to get the JSONDecodeError fixed with your suggestion. Training a new model with that manifest shows nearly the same result, convergence in around 1750 update steps.
Training a new model with that manifest shows nearly the same result, convergence in around 1750 update steps.
Ok let me try reproducing using this minimal manifest. I can't seem to train a CitriNet model as I get the following error:
Traceback (most recent call last):
File "train.py", line 66, in main
trainer.fit(asr_model)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
self._call_and_handle_interrupt(
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
self._run_sanity_check(self.lightning_module)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1375, in _run_sanity_check
self._evaluation_loop.run()
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
output = self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 239, in validation_step
return self.training_type_plugin.validation_step(*step_kwargs.values())
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 444, in validation_step
return self.model(*args, **kwargs)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 92, in forward
output = self.module.validation_step(*inputs, **kwargs)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/nemo/collections/asr/models/ctc_models.py", line 650, in validation_step
self._wer.update(
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/torchmetrics/metric.py", line 263, in wrapped_func
return update(*args, **kwargs)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/nemo/collections/asr/metrics/wer_bpe.py", line 195, in update
reference = self.decode_tokens_to_str(target)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/nemo/collections/asr/metrics/wer_bpe.py", line 148, in decode_tokens_to_str
hypothesis = self.tokenizer.ids_to_text(tokens)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/nemo/collections/common/tokenizers/sentencepiece_tokenizer.py", line 142, in ids_to_text
return self.tokenizer.decode_ids(ids)
File "/home/allabana/.virtualenvs/nm/lib/python3.8/site-packages/sentencepiece/__init__.py", line 174, in DecodeIdsWithCheck
return _sentencepiece.SentencePieceProcessor_DecodeIdsWithCheck(self, ids)
IndexError: Out of range: piece id is out of range.
Have you seen this before/any suggestions to debug?
I tried recreating the tokenizer with a smaller vocab size (1024 -> 48) and still didn't work
python process_asr_tokenizer.py --manifest /audio-data/debug-nemo/manifest.json --data_root tokenizers/debug-nemo --vocab_size 48 --tokenizer spe --spe_type unigram --no_lower_case --log
Edit: did some debugging, and the ids appear to be similar to the behavior mentioned earlier.
[89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0, 89, 0]
I'm starting to think this is an issue w/ Sentencepiece and/or maybe torch
...
This is the manifest after loading and writing to another file with the ensure_ascii=True flag for json.dumps()
{"audio_filepath": "2_21_wa436789.wav", "duration": 18.71, "text": "\u064a\u064e\u0627 \u0623\u064e\u064a\u064f\u0651\u0647\u064e\u0627 \u0627\u0644\u0646\u064e\u0651\u0627\u0633\u064f \u0627\u0639\u0652\u0628\u064f\u062f\u064f\u0648\u0627 \u0631\u064e\u0628\u064e\u0651\u0643\u064f\u0645\u0652 \u0627\u0644\u064e\u0651\u0630\u0650\u064a \u062e\u064e\u0644\u064e\u0642\u064e\u0643\u064f\u0645\u0652 \u0648\u064e\u0627\u064e\u0644\u064e\u0651\u0630\u0650\u064a\u0646\u064e \u0645\u0650\u0646\u0652 \u0642\u064e\u0628\u0652\u0644\u0650\u0643\u064f\u0645\u0652 \u0644\u064e\u0639\u064e\u0644\u064e\u0651\u0643\u064f\u0645\u0652 \u062a\u064e\u062a\u064e\u0651\u0642\u064f\u0648\u0646\u064e"}
{"audio_filepath": "20_1_ytkurdi.wav", "text": "\u0637\u0647 \u0645\u064e\u0627 \u0623\u064e\u0646\u0652\u0632\u064e\u0644\u0652\u0646\u064e\u0627 \u0639\u064e\u0644\u064e\u064a\u0652\u0643 \u0627\u0644\u0652\u0642\u064f\u0631\u0652\u0622\u0646\u064e \u0644\u064e\u062a\u064f\u0634\u0652\u0642\u064e\u0651\u0649 \u0625\u0644\u064e\u0651\u0627 \u062a\u064e\u0630\u0652\u0643\u0650\u0631\u064e\u0629\u064b \u0644\u0650\u0645\u064e\u0646\u0652 \u064a\u064e\u062e\u0652\u0634\u064e\u0649 \u062a\u064e\u0646\u0652\u0632\u0650\u064a\u0644\u064b\u0627 \u0645\u0650\u0645\u064e\u0651\u0646\u0652 \u062e\u064e\u0644\u064e\u0642\u064e \u0627\u0644\u0652\u0623\u064e\u0631\u0652\u0636\u064e \u0648\u064e\u0627\u0644\u0633\u064e\u0651\u0645\u064e\u0627\u0648\u064e\u0627\u062a\u0650 \u0627\u0644\u0652\u0639\u0650\u0644\u0627 \u0627\u0644\u0631\u064e\u0651\u062d\u0652\u0645\u064e\u0646\u064e \u0639\u064e\u0644\u064e\u0649 \u0627\u0644\u0652\u0639\u064e\u0631\u0652\u0634\u0650 \u0627\u0633\u0652\u062a\u064e\u0648\u064e\u0649", "duration": 29.01}
{"audio_filepath": "36_1_ytnoreen.wav", "text": "\u064a\u0633 \u0648\u064e\u0627\u0644\u0652\u0642\u064f\u0631\u0652\u0622\u0646\u064f \u0627\u0644\u0652\u062d\u064e\u0643\u0650\u064a\u0645\u064f \u0625\u0646\u064e\u0651\u0643 \u0644\u0650\u0645\u0650\u0646\u0652 \u0627\u0644\u0652\u0645\u064f\u0631\u0652\u0633\u064e\u0644\u0650\u064a\u0646\u064e \u0639\u064e\u0644\u064e\u0649 \u0635\u0650\u0631\u064e\u0627\u0637\u064d \u0645\u064f\u0633\u0652\u062a\u064e\u0642\u0650\u064a\u0645\u064d", "duration": 15.48}
Training seems to progress fine, converging after 1100 steps -
[NeMo I 2022-03-07 17:54:39 rnnt_wer_bpe:232] reference :يَا أَيُّهَا النَّاسُ اعْبُدُوا رَبَّكُمْ الَّذِي خَلَقَكُمْ وَاَلَّذِينَ مِنْ قَبْلِكُمْ لَعَلَّكُمْ تَتَّقُونَ
[NeMo I 2022-03-07 17:54:39 rnnt_wer_bpe:233] predicted :يَا أَيُّهَا النَّاسُ اعْبُدُوا رَبَّكُمْ الَّذِي خَلَقَكُمْ وَاَلَّذِينَ مِنْ قَبْلِكُمْ لَعَلَّكُمْ تَتَّقُونَ
[NeMo I 2022-03-07 17:54:40 rnnt_wer_bpe:232] reference :طه مَا أَنْزَلْنَا عَلَيْك الْقُرْآنَ لَتُشْقَّى إلَّا تَذْكِرَةً لِمَنْ يَخْشَى تَنْزِيلًا مِمَّنْ خَلَقَ الْأَرْضَ وَالسَّمَاوَاتِ الْعِلا الرَّحْمَنَ عَلَى الْعَرْشِ اسْتَوَى
[NeMo I 2022-03-07 17:54:40 rnnt_wer_bpe:233] predicted :طه مَا أَنْزَلْنَا عَلَيْك الْقُرْآنَ لَتُشْقَّى إلَّا تَذْكِرَةً لِمَنْ يَخْشَى تَنْزِيلًا مِمَّنْ خَلَقَ الْأَرْضَ وَالسَّمَاوَاتِ الْعِلا الرَّحْمَنَ عَلَى الْعَرْشِ اسْتَوَى
[NeMo I 2022-03-07 17:54:40 rnnt_wer_bpe:232] reference :يس وَالْقُرْآنُ الْحَكِيمُ إنَّك لِمِنْ الْمُرْسَلِينَ عَلَى صِرَاطٍ مُسْتَقِيمٍ
[NeMo I 2022-03-07 17:54:40 rnnt_wer_bpe:233] predicted :يس وَالْقُرْآنُ الْحَكِيمُ إنَّك لِمِنْ الْمُرْسَلِينَ عَلَى صِرَاطٍ مُسْتَقِيمٍ
Epoch 1099: 100%|█████████| 4/4 [00:02<00:00, 1.44it/s, loss=0.157, v_num=6-42
Epoch 1099, global step 1099: val_wer reached 0.00000
The error above occurs when sentencepiece tries to detokenize a subword but with wrong vocabulary index - for example using a CTC/RNNT decoder with 1024+1 vocab size but a sentencepiece constructed with only 48 tokens will raise this error. I am using the tokenizer provided by you, without reconstruction via text. Simply doing that it seems to work well, the WER is dropping correctly and quickly.
Attached the model config + train script (rename the .txt out of them). conformer_transducer_bpe.yaml.txt speech_to_text_rnnt_bpe.py.txt
I think I found the issue...
If I do not initialize the model from a pre-trained model (ex. stt_en_citrinet_1024
) then I am able to see the reference/predicted text accurately and the model overfits on the 3-file manifest. 🎉
However, in order for us to get good results on our dataset, we need to finetune from a pre-trained model.
It seems something must have changed with the model hosted over at NGC which is why we suddenly could not longer reproduce our previous results.
If my observation is true, is it possible to use an old version of stt_en_citrinet_1024
?
If not, what language do you recommend we fine tune from?
I am aware that the Riva team is exploring training Arabic models (and we're working on preparing some public datasets for use), but that probably won't be for a few months I'm guessing.
All NGC versions are pointed to on the NGC page of the model - you can manually download any Nemo file and load that for finetuning.
I don't think we have updated that model card anytime soon. And even if we did, it's vocabulary would update with the change_vocabulary() step. So why would the models predictions be ?? during training ?
Oh wait.. I've just realized your dataset setup is before your tokenizer setup. That would explain what's happening -
1) your text gets tokenized by old internal tokenizer if the model before change. This gets preserved. 2) you change tokenizer, old subwords no longer map to new IDs, and the pretokenized text no longer corresponds to New tokenizer IDs. 3) your vocab and everything updates to new tokenizer but your pretokenized text is old tokenizer IDs so when you decide it is maps to random subwords.
Try pushing the change vocabulary right after setup trainer.
Try pushing the change vocabulary right after setup trainer.
Wow... that is so nuanced! And it fixed the issue 🎉 I don't think it would have been possible to think that was the issue if someone didn't have insight into NeMo's internals.
This really needs to be documented/commented/highlighted somewhere... anywhere. At least in the cross-language fine-tuning tutorial or something.
Thank you so much for helping me debug this @titu1994! Appreciate the time spent going through this.
Agreed that we need to improve documentation, such issues are subtle and incredibly hard to debug. Theres no way to even test this easily to raise an error.
Would you comment on if such a execution diagram would avoid such future mistakes ? https://github.com/NVIDIA/NeMo/pull/3812
To check the result you need to visit the branch - https://github.com/titu1994/NeMo/tree/high_level_asr_diag/examples/asr/asr_ctc
@titu1994 a diagram like that is of course helpful. I keep wandering though about the following: Say one uses just the yaml config file, provides their own train, validation, test datasets, tokenizer, and also the yaml config parameter init_from_pretrained
or init_from_nemo_model
. What will happen in that case, will the order of execution be correct.
I'm asking this because, in the case of the Arabic model initiated from English, the miss-ordering lead to UNK symbols. However, I would assume that in the case of some other latin character based model, the "error", as you said, could be much more subtle and hard to spot (resulting in just in a higher WER then optimal, lost/un-transcribed words, or sth.).
Ditto, this diagram is helpful. Will leave comments on the PR.
If possible, updating the ASR_CTC_Language_Finetuning
tutorial would also be helpful.
Just a sentence in the "Update the vocabulary" section saying something along the lines of "The change_vocabulary()
must be performed before setting up your datasets otherwise you might get decoding errors."
@piraka9011 In general ill add a note to the very top of the notebook to review the execution flow diagram before starting finetuning.
@itzsimpl The ordering given there is what happens inside the constructor of the nemo model itself. When you use init from pretrained - it will first load your model with your config and data loaders etc and only load the pytorch checkpoint weights from the older model into your already initialized model.
The older model's tokenizer or dataloaders are not used at all, only its weights are copied into your new model.
Describe the bug
I am training an Arabic model with diacritics. Digitally, each diacritic is represented as a separate (unicode) character from the actual letter. Here's an example of the vocabulary/text that we are using. There are 10,878 unique words in the vocabulary (so large enough for a SPE tokenizer with a
vocab_size
of 1024).I am able to generate a SPE-based tokenizer using our diacritized vocabulary. I can confirm that the generated
document.txt
andvocab.txt
for the tokenizer have Arabic text represented correctly. However, somewhere during training the tokenizer fails to decode the text properly.This is an example output from the
WER
metric:Notice two things here:
reference
text is only two characters:⁇
andغ
.If I train a model without diacritics, the same behavior occurs. The last working NeMo version was
1.4.0
(will test with1.5.0
). Previously, the model would converge at a reasonable rate and achieves good results (~4% WER).Steps/Code to reproduce bug
There's nothing really special about my training script, it's taken from the examples. The model is finetuned from the
stt_en_citrinet_1024
model.cfg.model_target
isnemo.collections.asr.models.EncDecCTCModelBPE
The tokenizer is generated using process_asr_text_tokenizer.py:
Expected behavior
The model should be able to converge and predict Arabic text accurately.
Environment overview (please complete the following information)
nvcr.io/nvidia/pytorch:21.10-py3
)pip install nemo_toolkit[all]==1.7.0
docker pull
&docker run
commands usedDockerfile:
Additional context
GPU: 8xV100 (AWS p3dn.24xlarge)