Open JeromeNi opened 3 years ago
I have the same problems with TIMIT and Librispeech-10h corpus. But even worse, I nearly got nothing for the decode results with _w2vugenerate.py. There is only one phone in a line for the result valid.txt of Librispeech-10h. Do you know how to fix this ……
I have the same problems with TIMIT and Librispeech-10h corpus. But even worse, I nearly got nothing for the decode results with _w2vugenerate.py. There is only one phone in a line for the result valid.txt of Librispeech-10h. Do you know how to fix this ……
I have not experienced this issue. For me, the output all follows a certain pattern (probably of the LM) but seems to have completely ignored the acoustic features:
sil hh iy z sil b eh sil s iy ah l ah ng ah sil m oy sil t f ih n dh ah z ih z sil d aa sil m aa r sil sil hh iy t m sil p er sil k er ih w ih sil g eh v m ih dx ae n sil z ih z ah sil k ih sil ch eh v iy s ih m sil k y uw n hh eh r w eh sil s sil sil hh iy ih s sil d l uw y uw l sil ch ay n ih hh ae sil r ih k l sil sil hh iy m aa r ah sil d ih z t uw n ah sil k s sil eh dx ah l z w ah l iy sil t l ih sil k n l t ih l ah sil k er sil sil hh iy t aa r l ae n sil g r l ng sil ah sil p iy n sil g r iy ng sil z ih s sil b ih dh ah sil b uw s t aa l hh eh s sil sil hh iy ih n sil iy er t s ey sil k uw sil jh aa sil t ah sil k l ay er z aa er s sil k aa r s eh m iy s ng sil sil s iy t ow l sil t er s sil k ah m l ay l f iy uw iy r iy v w aa n dh ah t aa l er sil m ih dx ah n ih sil sil hh iy sil t r uw ng z iy aa l ah n d ih f er sil b l er sil d er l iy t ae n sil ah sil k jh aa s sil sil hh ih n g hh aw n sil g eh m er ih n sil t er m y uw ng sil k sil t ay ih sil k w eh ih n z ih sil k ah sil b l aa sil d eh dx er v er r ih dx ah sil k ih s sil k m sil p eh sil
Hi @JeromeNi @JINGZIjingzi, I tried upto data preparation "zsh scripts/prepare_text.sh language /path/to/text/file /output/dir". later i put the code under data, models, tasks into fairseq's corresponding directories and changed init.py under those directories. But I'm not able to solve this error please give any suggestions and what extra changes needed to run this code.
File "/fairseq/fairseq/models/wav2vec_u.py", line 19, in
I haven't met this before. Perhaps the file path is not correct. I copy the files in {fairseq_root}/fairseq/examples/wav2vec/unsupervised/{models, tasks, data} to {fairseq_root}/fairseq/{models/wav2vec, tasks, data}.
Thank you @JINGZIjingzi,
I tried to run code by copying files from {fairseq_root}/fairseq/examples/wav2vec/unsupervised/{models, tasks, data} to {fairseq_root}/fairseq/{models/wav2vec, tasks, data}. i didn't get it.
I clearly expalin exact changes what i did. 1.fairseq/models/wav2vec folder: i)copy file {fairseq_root}/fairseq/examples/wav2vec/unsupervised/models/wav2vec_u.py to {fairseq_root}/fairseq/models/wav2vec/wav2vec_u.py ii) added one moreline .wav2vec_u import Wav2vec_U in {fairseq_root}/fairseq/models/wav2vec/init.py
2.fairseq/tasks folder i)copy file {fairseq_root}/fairseq/examples/wav2vec/unsupervised/tasks/unpaired_audio_text.py to {fairseq_root}/fairseq/tasks/unpaired_audio_text.py
3.fairseq/data folder i)copy files {fairseq_root}/fairseq/examples/wav2vec/unsupervised/data/extracted_features_dataset.py, random_input_dataset.py to {fairseq_root}/fairseq/data/random_input_dataset.py, extracted_features_dataset.py ii) added 4 lines in {fairseq_root}/fairseq/data/init.py from .extracted_features_dataset import ExtractedFeaturesDataset from .random_input_dataset import RandomInputDataset all = [ "already existed import names", "ExtractedFeaturesDataset", "RandomInputDataset", ]
I'm trying preprocessing and training from week i strucked at this place. Could you plz give any suggestions. Thank you
Are the log messages still the same? I can not figure out what is wrong with your codes. I made the same changes as you described above.
Hi @JINGZIjingzi,
Before i kept code in {fairseq_root}/fairseq/{models, tasks, data} i got previously posted error.
From ur suggestion i changed to {fairseq_root}/fairseq/{models/wav2vec, tasks, data}.
Then i'm getting following error File "fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main distributed_utils.call_main(cfg, pre_main) File "fairseq/fairseq/distributed/utils.py", line 369, in call_main main(cfg, **kwargs) File "fairseq/fairseq_cli/train.py", line 88, in main task = tasks.setup_task(cfg.task) File "fairseq/fairseq/tasks/init.py", line 45, in setup_task assert task is not None, f"Could not infer task type from {cfg}. Available tasks: {TASK_REGISTRY.keys()}" AssertionError: Could not infer task type from {'_name': 'unpaired_audio_text', 'data': 'fairseq/exp/feats/precompute_pca512_cls128_mean_pooled', 'text_data': 'fairseq/exp/lm/phones', 'labels': 'phn', 'sort_by_length': False, 'unfiltered': False, 'max_length': None, 'append_eos': False, 'kenlm_path': 'fairseq/exp/lm/phones/lm.phones.filtered.04.bin'}. Available tasks: dict_keys(['hubert_pretraining', 'speech_recognition', 'gan_audio_pretraining_feats', 'translation', 'multilingual_translation', 'semisupervised_translation', 'translation_multi_simple_epoch', 'translation_from_pretrained_bart', 'legacy_masked_lm', 'cross_lingual_lm', 'denoising', 'multilingual_denoising', 'language_modeling', 'audio_pretraining', 'multilingual_masked_lm', 'sentence_ranking', 'speech_to_text', 'simul_speech_to_text', 'simul_text_to_text', 'online_backtranslation', 'masked_lm', 'translation_lev', 'translation_from_pretrained_xlm', 'sentence_prediction', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt'])
I got it. You can change _@register_task("gan_audio_pretrainingfeats", dataclass=UnpairedAudioTextConfig) in tasks/unpaired_audiotext.py to @register_task("unpaired_audiotext", dataclass=UnpairedAudioTextConfig).
Thank you @JINGZIjingzi, now i got it. If i get proper results i will inform.
Update: The LibriSpeech-100h result seems to look fine. The phone error rate from wav2vec_generate is like 28% from one of the runs. I just messed up the order of transcriptions in *.phn files for evaluation earlier.
That's great! I am still stuck in the Librispeech-10h. My dict.phn.txt is like this. (The first number in each line is not included in the file. They are just showed to note the line.) But I don't know how to generate the corresponding .phn files. There are only .wrd and *.ltr files. So I did not use any labels when training a gan model. The results with w2vu_generate.py are nearly empty. Could u give some advice? Thanks a lot!
1 s 4399
2 n 4396
3 ɪ 4256
4 t 3656
5 d 3250
6 ɹ 2975
7 l 2951
8 ə 2812
9 k 2779
10 ɛ 2047
11 z 2012
12 p 1993
13 m 1986
14 æ 1537
15 ɚ 1464
16 ᵻ 1294
17 ŋ 1223
18 b 1206
19 eɪ 1206
20 f 1180
21 i 1175
22 v 920
23 ʌ 900
24 ʃ 823
25 iː 803
26 ɑ ː 779
27 oʊ 775
28 aɪ 755
29 ɡ 755
30 ɾ 750
31 uː 699
32 əl 678
33 w 632
34 dʒ 529
35 h 501
...
57 aɪə 40
58 n̩ 35
59 ʔ 34
60 <SIL> 0
That's great! I am still stuck in the Librispeech-10h. My dict.phn.txt is like this. (The first number in each line is not included in the file. They are just showed to note the line.) But I don't know how to generate the corresponding .phn files. There are only .wrd and *.ltr files. So I did not use any labels when training a gan model. The results with w2vu_generate.py are nearly empty. Could u give some advice? Thanks a lot!
1 s 4399 2 n 4396 3 ɪ 4256 4 t 3656 5 d 3250 6 ɹ 2975 7 l 2951 8 ə 2812 9 k 2779 10 ɛ 2047 11 z 2012 12 p 1993 13 m 1986 14 æ 1537 15 ɚ 1464 16 ᵻ 1294 17 ŋ 1223 18 b 1206 19 eɪ 1206 20 f 1180 21 i 1175 22 v 920 23 ʌ 900 24 ʃ 823 25 iː 803 26 ɑ ː 779 27 oʊ 775 28 aɪ 755 29 ɡ 755 30 ɾ 750 31 uː 699 32 əl 678 33 w 632 34 dʒ 529 35 h 501 ... 57 aɪə 40 58 n̩ 35 59 ʔ 34 60 <SIL> 0
I used this package reference in the footnote of the paper (https://github.com/Kyubyong/g2p) for converting English text to phoneme transcriptions. I removed all the numerical stress from the vowels and an additional (') symbol to get the following 39-phoneme set as described in the original paper:
AH 336212 N 238016 S 209100 T 194878 L 188633 IH 182116 R 172703 K 154411 IY 138375 Z 128619 D 124602 M 113743 ER 101165 EH 100869 AA 98322 AE 84627 B 81689 P 80531 OW 69927 G 55230 F 53820 EY 47962 UW 43357 V 42622 AO 42569 W 38987 AY 35320 HH 34826 NG 32503 SH 30457 JH 25340 Y 21735 CH 20112 TH 17105 AW 13323 UH 8315 OY 5219 DH 2440 ZH 1404
0
You can use the following code to generate *.phn files (I modified it from one of the scripts provided just to remove additional lexical stress) https://gist.github.com/JeromeNi/2d3118d9685a9ea4cdcc66d5bc8659c8
To train the GAN, you may not need the parallel text from LibriSpeech, but you do need some sort of non-parallel text. I simply used the LibriSpeech LM corpus (https://www.openslr.org/11/; librispeech-lm-norm.txt.gz), which I believe was also used in the authors' experiments regarding LibriSpeech. The readme file there says that “the texts were selected so that to avoid using texts that have even partial overlap with the books on which the LibriSpeech training and development sets are based". The corpus is fairly large, and can cause some computational burden when generating the FSTs (though likely not required for GAN training ), but I just used the whole corpus anyway.
I have yet to figure out why the model fails on TIMIT though.
Thank you so much! I will try again by your steps.
I have installed wav2letter python bindings, which is a bit easier to install.
@JeromeNi I tried the new pipeline, and the results seem a bit better. The PER of my valid set is 64.4% with viterbi decoder, but for my test set (dev_clean in Librispeech) the PER is 89.3%. To train faster, I only use 1% of the librispeech-lm-norm.txt text data and 10h subset from Librispeech for audio data. I also change max_update in GAN training from 150000 to 50000. I am not sure whether the amount of data is the main reason for the bad results. Have you ever tried smaller datasize and how is the performance?
Besides, I meet another problem when generating the results with w2vu_generate.py. The outputs of Generator in wav2vec_u.py include some nan symbols. The VITERBI decoder seems fine with the nans while the KENLM decoder would output none for the emissions with nan. So I can not use w2vu_generate.py to obtain the results with kenlm decoder. Do you met this before?
@JeromeNi I tried the new pipeline, and the results seem a bit better. The PER of my valid set is 64.4% with viterbi decoder, but for my test set (dev_clean in Librispeech) the PER is 89.3%. To train faster, I only use 1% of the librispeech-lm-norm.txt text data and 10h subset from Librispeech for audio data. I also change max_update in GAN training from 150000 to 50000. I am not sure whether the amount of data is the main reason for the bad results. Have you ever tried smaller datasize and how is the performance?
Besides, I meet another problem when generating the results with w2vu_generate.py. The outputs of Generator in wav2vec_u.py include some nan symbols. The VITERBI decoder seems fine with the nans while the KENLM decoder would output none for the emissions with nan. So I can not use w2vu_generate.py to obtain the results with kenlm decoder. Do you met this before?
The problems are fixed. I didn't remove the silence of test set before. The results seem right when I prepare the test data again. The PER of test (dev_clean in Librispeech) is 24%.
@JeromeNi I tried the new pipeline, and the results seem a bit better. The PER of my valid set is 64.4% with viterbi decoder, but for my test set (dev_clean in Librispeech) the PER is 89.3%. To train faster, I only use 1% of the librispeech-lm-norm.txt text data and 10h subset from Librispeech for audio data. I also change max_update in GAN training from 150000 to 50000. I am not sure whether the amount of data is the main reason for the bad results. Have you ever tried smaller datasize and how is the performance? Besides, I meet another problem when generating the results with w2vu_generate.py. The outputs of Generator in wav2vec_u.py include some nan symbols. The VITERBI decoder seems fine with the nans while the KENLM decoder would output none for the emissions with nan. So I can not use w2vu_generate.py to obtain the results with kenlm decoder. Do you met this before?
The problems are fixed. I didn't remove the silence of test set before. The results seem right when I prepare the test data again. The PER of test (dev_clean in Librispeech) is 24%.
That's great! Is it the PER reported by wav2vec_generate directly? This is also around the PER I get when using 100h of LibriSpeech clean. How many different hyperparameter settings have you tried so far?
@JeromeNi I tried the new pipeline, and the results seem a bit better. The PER of my valid set is 64.4% with viterbi decoder, but for my test set (dev_clean in Librispeech) the PER is 89.3%. To train faster, I only use 1% of the librispeech-lm-norm.txt text data and 10h subset from Librispeech for audio data. I also change max_update in GAN training from 150000 to 50000. I am not sure whether the amount of data is the main reason for the bad results. Have you ever tried smaller datasize and how is the performance? Besides, I meet another problem when generating the results with w2vu_generate.py. The outputs of Generator in wav2vec_u.py include some nan symbols. The VITERBI decoder seems fine with the nans while the KENLM decoder would output none for the emissions with nan. So I can not use w2vu_generate.py to obtain the results with kenlm decoder. Do you met this before?
The problems are fixed. I didn't remove the silence of test set before. The results seem right when I prepare the test data again. The PER of test (dev_clean in Librispeech) is 24%.
That's great! Is it the PER reported by wav2vec_generate directly? This is also around the PER I get when using 100h of LibriSpeech clean. How many different hyperparameter settings have you tried so far?
Yes. The PER 24% is reported by wav2vec_generate directly. I use 10h audio in Librispeech and 10% text of the librispeech-lm-norm.txt. The hyperparameter settings are original except for _best_checkpointmetric in the config. I change it from _weighted_lmppl to uer, so the labels of valid set are needed. At the same time, the experiment with _weighted_lmppl shows poor performance. The PER of test is 86%. Do you have any idea about this?
I did not use UER because I wanted the experiment to be completely unsupervised. How many seeds have you tried? I remembered that for one of the random seeds within the range of 0-4, I got much higher LM perplexity score with a much higher UER as well.
There has been an update to the w2v-u yesterday so I am still updating my local repo before I could try again...
The seed is set to 1 for all the exps. I will try other seeds. Thanks!
Hi @JINGZIjingzi ,@JeromeNi, After training 100h librispeech speech data and librispeech-lm-norm.txt text data, i got following results
ref : I HAVE NOT THE PLEASURE OF UNDERSTANDING YOU SAID HE ref : AY HH AE V N AA T DH AH P L EH ZH ER AH V AH N D ER S T AE N D IH NG Y UW S EH D HH IY hyp : M AY D DH AE T AH N DH AH HH AA R TH NG IH NG K AH DH ER S AH N K AH N L IY K AO T M AE IH NG
ref: SAID SYLVIA SHIVERING ALL OVER WITH PASSION ref: S EH D S IH L V IY AH SH IH V ER IH NG AO L OW V ER W IH TH P AE SH AH N hyp: AO T ER T AH SH AH K IY M R OW P AH EH R AH N D HH EH N R OW IH NG
I'm getting wrong results but number of phonemes lenght in a sequence matching approximately original phoneme sequence. Where we need to change seed, while training i didn't get that.
Hi @JINGZIjingzi ,@JeromeNi, After training 100h librispeech speech data and librispeech-lm-norm.txt text data, i got following results
ref : I HAVE NOT THE PLEASURE OF UNDERSTANDING YOU SAID HE ref : AY HH AE V N AA T DH AH P L EH ZH ER AH V AH N D ER S T AE N D IH NG Y UW S EH D HH IY hyp : M AY D DH AE T AH N DH AH HH AA R TH NG IH NG K AH DH ER S AH N K AH N L IY K AO T M AE IH NG
ref: SAID SYLVIA SHIVERING ALL OVER WITH PASSION ref: S EH D S IH L V IY AH SH IH V ER IH NG AO L OW V ER W IH TH P AE SH AH N hyp: AO T ER T AH SH AH K IY M R OW P AH EH R AH N D HH EH N R OW IH NG
I'm getting wrong results but number of phonemes lenght in a sequence matching approximately original phoneme sequence. Where we need to change seed, while training i didn't get that.
How many iterations have you run? From the log, it seems that I did not get lower than 80% UER during training until past 30k updates, but from there it only took an additional 10k updates to achieve lower than 30% UER.
Or maybe check if your reference transcripts in .wrd/.phn match the order within those in *.tsv? Running wav2vec_manifest.py on a new directory of silence-removed audio files may rearrange the utterance orders in the new tsv files.
Hi, This below log is from final epoch 834 @ 150000 updates, Train log: {"epoch": 834, "train_loss": "3.973", "train_ntokens": "157.383", "train_nsentences": "157.383", "train_temp": "0.1", "train_code_ppl": "16.575", "train_loss_code_pen": "0.644", "train_loss_smoothness": "2.901", "train_loss_dense_g": "4.421", "train_lm_score_sum": 0.0, "train_num_pred_chars": 0.0, "train_loss_grad_pen": "0.06", "train_loss_dense_d": "0.026", "train_loss_token_d": "0.024", "train_wps": "672", "train_ups": "4.27", "train_wpb": "157.4", "train_bsz": "157.4", "train_num_updates": "150000", "train_lr_discriminator": "0.0005", "train_lr_generator": "0.0004", "train_gnorm": "34.501", "train_clip": "53.3", "train_train_wall": "11", "train_gb_free": "22.3", "train_wall": "29399"}.
valid: I'm not using any validation set just to overcome error i kept few utterances with random text. because it is unsupervised it won't depend on valid set right? after training completely i checked generated phonemes with original phoneme sequence. Here i made any mistake?
using wav2vec_vox_new.pt model for extracting wav2vec2 features from 14th layer. librispeech-lm-norm.txt text used: dict.phn.txt AH 336212 N 238016 S 209100 T 194878 L 188633 IH 182116 R 172703 K 154411 IY 138376 Z 128619 D 124602 M 113743 ER 101165 EH 100869 AA 98322 AE 84627 B 81689 P 80531 OW 69927 G 55230 F 53820 EY 47962 UW 43357 V 42622 AO 42569 W 38987 AY 35319 HH 34826 NG 32503 SH 30457 JH 25340 Y 21735 CH 20112 TH 17105 AW 13323 UH 8315 OY 5219 DH 2440 ' 2339 ZH 1404 \<SIL> 0 From above discussion i will remove (') symbol later. Finally are you able to map phonemes to words with given WER in paper?
I'm not using any validation set just to overcome error i kept few utterances with random text. because it is unsupervised it won't depend on valid set right? after training completely i checked generated phonemes with original phoneme sequence. Here i made any mistake?
I think validation set is needed though it is unsupervised.
Finally are you able to map phonemes to words with given WER in paper?
I have some problems to generate the HLG.fst file, which is needed for kaldi decoder. So I can't get the WER so far.
Thank you @JINGZIjingzi ,@JeromeNi, prepared feats again may be i missed something before, used seed 0 while training, epochs 1000 around 80hours librispeech clean dataset , tested on same dataset (librispeech 80 hours)got 21 PER.
@JeromeNi @JINGZIjingzi @shiva1393 Can you please explain a little bit why should we copy the code files under {fairseq_root}/fairseq/. I don't understand.
Thanks in advance for your help!
What is your question?
I am trying to replicate the results for the new wav2vec-u (https://ai.facebook.com/research/publications/unsupervised-speech-recognition) model (currently working on TIMIT). However, it seems that using the default code and scripts gave me something along the lines of 80% UER under the "matched" setting for the 400-utterance core-dev set, before applying self-training.
(Edit 06/01/2021: I changed the 'mean_pool' flag for the join segmentor to 'True' and the UER improved to 71.66%, but still far away from the reported results)
I have listed my procedures below and some minor modifications to get the code running.
Code
N/A; see below for the modifications.
What have you tried?
Below are my questions and procedures:
sil w iy l ay sil b l uw sil ch iy z sil b ah sil t v ih sil t er sil p er f er s sil w ih s sil ch iy s sil
prepare_audio.sh
without issues, using theLarge (LV-60k)
checkpoint. However, it seems that I do not need most of the code inprepare_text.sh
. Below are the lines I've kept:python $FAIRSEQ_ROOT/fairseq_cli/preprocess.py --dataset-impl mmap --trainpref $target_dir/train.phn --workers 16 --only-source --destdir $target_dir --srcdict $target_dir/dict.phn.txt
lmplz -o 4 -S 10% < $target_dir/train.phn --discount_fallback >! $target_dir/lm.phones.filtered.04.arpa
build_binary $target_dir/lm.phones.filtered.04.arpa $target_dir/lm.phones.filtered.04.bin
lmplz -o 6 -S 10% < $target_dir/train.phn --discount_fallback >! $target_dir/lm.phones.filtered.06.arpa
build_binary $target_dir/lm.phones.filtered.06.arpa $target_dir/lm.phones.filtered.06.bin
lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py fst_dir=$target_dir/fst/phn_to_phn_sil lm_arpa=$target_dir/lm.phones.filtered.06.arpa data_dir=$target_dir "blank_symbol='sil'"
I have to add-S 10%
as for some reason kenlm threw a malloc OOM error. I also cannot get the line invokingkaldi_initializer.py
to run as it threw the following error:As I understand the kaldi_initializer would not be used for model training, so I moved on-wards.
(Edit 06/03/2021: I have been able to get the kaldi_initializer.py to run by passing in extra arguments for kaldi_root=/path/to/kaldi and in_labels=phn)
data
,models
,tasks
into fairseq's corresponding directories and changed_init_.py
under those directories if necessary. I modified a few things to launch GAN training on top of preprocessed features:First, some of the class members are not defined correctly in
wav2vec_u.py
:The last three class members are not defined in the code so I had to add them. Not sure if I got those correctly.
I have also modified' are changed to 'sil'. (I probably should have replaced TIMIT sil with beforehand, but either way should work)
wav2vec_u.py
andunpaired_audio_text.py
so that all relevant hardcodes of '(Edit 06/03/2021: I read the code in
wav2vec_u.py
and it seems that in the functionvalid_step
, silences are removed with the linex = x[x != self.sil_id]
, but inprepare_text.sh
, the phone lm is built with silences. What is the rationale behind it?)I used the default hyper-parameters provided in
config/gan/w2vu.yaml
for training the model, but it seems that the script only loggedcheckpoint_best.pt
andcheckpoint_last.pt
(because no_epoch_checkpoints is set true in the cofig file) based onweighted_lm_ppl
, which seems to be the "vocabulary-usage adjusted entropy" mentioned on Page 14 of the paper, except for avocab_usage_power=2
as hardcoded inunpaired_audio_text.py
. I only usedcheckpoint_best.pt
for the later steps, and did not train/validate other model configurations.w2vu_generate.py
as follows:python w2vu_generate.py --config-dir config/generate --config-name viterbi \ fairseq.task.data=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_out/precompute_pca512_cls128_mean_pooled/ \ fairseq.common_eval.path=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/multirun/2021-05-27/04-18-34/0/checkpoint_last.pt \ lm_model=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_out/precompute_pca512_cls128_mean_pooled/lm.phones.filtered.04.bin \ fairseq.dataset.gen_subset=valid results_path=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/file/wav2vec_transcriptions/
python scripts/wer.py -s $FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_transcriptions/valid.txt -r $FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_transcriptions/valid_ref.txt
It seems that the
W2lViterbiDecoder
as selected by the defaultconfig/generate/viterbi.yaml
requires an additional argument forcriterion
. Therefore, I hardcoded it to be ctc:criterion: Optional[str] = field(default="ctc", metadata={"help": "VITERBI criterion?"},)
However, then the
wer.py
script reports the said 71.66% UER.Any idea what needs to be changed to get close to the PER reported?
(Edit 06/01/2021)
Why does the log tell me that
"train_ntokens": "616", "train_nsentences": "616"
, and that a single epoch finishes in 6 updates, even though TIMIT train set have 3696 examples and that the batch size in the config file has been set to 160?train.sh
withinkaldi_self_train
directory. For the w2v features, which set should the script use? Should I use the segment-level, mean pooled features as used by GAN? Because if so, Kaldi would throw the error thatutterance has too few frames to align
. I could only start Kaldi training with those prepared inprecompute_pca512
, instead of those inprecompute_pca512_cls128_mean_pooled
(Edit 06/03/2021: While I could get Kaldi started with the features in
precompute_pca512
, the script got stuck atsteps/decode.sh --nj 20 --cmd "$decode_cmd" \ $exp_root/mono/graph $data/$valid $exp_root/mono/decode_$valid &
withintrain_subset_lgbeam.sh
)Thanks very much for the help!
What's your environment?
pip
, source): source