facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.22k stars 6.38k forks source link

Help with replicating the results for wav2vec-u TIMIT #3581

Open JeromeNi opened 3 years ago

JeromeNi commented 3 years ago

What is your question?

I am trying to replicate the results for the new wav2vec-u (https://ai.facebook.com/research/publications/unsupervised-speech-recognition) model (currently working on TIMIT). However, it seems that using the default code and scripts gave me something along the lines of 80% UER under the "matched" setting for the 400-utterance core-dev set, before applying self-training.

(Edit 06/01/2021: I changed the 'mean_pool' flag for the join segmentor to 'True' and the UER improved to 71.66%, but still far away from the reported results)

I have listed my procedures below and some minor modifications to get the code running.

Code

N/A; see below for the modifications.

What have you tried?

Below are my questions and procedures:

  1. For getting TIMIT results, is {train,valid,test}.phn the only set of transcriptions needed? I followed the discussions here (https://github.com/pytorch/fairseq/issues/3425) for data generation, where each line in *.phn matches the order of the corresponding tsv files, and formatted as follows: sil w iy l ay sil b l uw sil ch iy z sil b ah sil t v ih sil t er sil p er f er s sil w ih s sil ch iy s sil
  2. Once I installed faiss, I could run prepare_audio.sh without issues, using the Large (LV-60k) checkpoint. However, it seems that I do not need most of the code in prepare_text.sh. Below are the lines I've kept: python $FAIRSEQ_ROOT/fairseq_cli/preprocess.py --dataset-impl mmap --trainpref $target_dir/train.phn --workers 16 --only-source --destdir $target_dir --srcdict $target_dir/dict.phn.txt lmplz -o 4 -S 10% < $target_dir/train.phn --discount_fallback >! $target_dir/lm.phones.filtered.04.arpa build_binary $target_dir/lm.phones.filtered.04.arpa $target_dir/lm.phones.filtered.04.bin lmplz -o 6 -S 10% < $target_dir/train.phn --discount_fallback >! $target_dir/lm.phones.filtered.06.arpa build_binary $target_dir/lm.phones.filtered.06.arpa $target_dir/lm.phones.filtered.06.bin lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py fst_dir=$target_dir/fst/phn_to_phn_sil lm_arpa=$target_dir/lm.phones.filtered.06.arpa data_dir=$target_dir "blank_symbol='sil'" I have to add -S 10% as for some reason kenlm threw a malloc OOM error. I also cannot get the line invoking kaldi_initializer.py to run as it threw the following error:

    Traceback (most recent call last): File "/nobackup/users/junruin2/fairseq/examples/speech_recognition/kaldi/kaldi_initializer.py", line 677, in cli_main initalize_kaldi(cfg) File "/nobackup/users/junruin2/fairseq/examples/speech_recognition/kaldi/kaldi_initializer.py", line 616, in initalize_kaldi cfg.out_labels = cfg.in_labels omegaconf.errors.MissingMandatoryValue: Missing mandatory value: in_labels full_key: in_labels reference_type=Optional[Dict[Union[str, Enum], Any]] object_type=dict

As I understand the kaldi_initializer would not be used for model training, so I moved on-wards.

(Edit 06/03/2021: I have been able to get the kaldi_initializer.py to run by passing in extra arguments for kaldi_root=/path/to/kaldi and in_labels=phn)

  1. I then put the code under data, models, tasks into fairseq's corresponding directories and changed _init_.py under those directories if necessary. I modified a few things to launch GAN training on top of preprocessed features:

First, some of the class members are not defined correctly in wav2vec_u.py:

self.discriminator` = self.Discriminator(output_size, cfg) self.discriminator = Discriminator(output_size, cfg)

self.generator = self.Generator(d, output_size, cfg, lambda x: self.normalize(x)[0]) self.generator = Generator(d, output_size, cfg)

self.zero_pretrain_updates = 0 self.exponential_code_pen = False self.dynamic_step_thresh = 0

The last three class members are not defined in the code so I had to add them. Not sure if I got those correctly.

I have also modified wav2vec_u.py and unpaired_audio_text.py so that all relevant hardcodes of '' are changed to 'sil'. (I probably should have replaced TIMIT sil with beforehand, but either way should work)

(Edit 06/03/2021: I read the code in wav2vec_u.py and it seems that in the function valid_step, silences are removed with the line x = x[x != self.sil_id], but in prepare_text.sh, the phone lm is built with silences. What is the rationale behind it?)

I used the default hyper-parameters provided in config/gan/w2vu.yaml for training the model, but it seems that the script only logged checkpoint_best.pt and checkpoint_last.pt (because no_epoch_checkpoints is set true in the cofig file) based on weighted_lm_ppl, which seems to be the "vocabulary-usage adjusted entropy" mentioned on Page 14 of the paper, except for a vocab_usage_power=2 as hardcoded in unpaired_audio_text.py. I only used checkpoint_best.pt for the later steps, and did not train/validate other model configurations.

  1. I then invoke w2vu_generate.py as follows: python w2vu_generate.py --config-dir config/generate --config-name viterbi \ fairseq.task.data=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_out/precompute_pca512_cls128_mean_pooled/ \ fairseq.common_eval.path=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/multirun/2021-05-27/04-18-34/0/checkpoint_last.pt \ lm_model=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_out/precompute_pca512_cls128_mean_pooled/lm.phones.filtered.04.bin \ fairseq.dataset.gen_subset=valid results_path=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/file/wav2vec_transcriptions/ python scripts/wer.py -s $FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_transcriptions/valid.txt -r $FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_transcriptions/valid_ref.txt

It seems that the W2lViterbiDecoder as selected by the default config/generate/viterbi.yaml requires an additional argument for criterion. Therefore, I hardcoded it to be ctc: criterion: Optional[str] = field(default="ctc", metadata={"help": "VITERBI criterion?"},)

However, then the wer.py script reports the said 71.66% UER.

Any idea what needs to be changed to get close to the PER reported?

(Edit 06/01/2021)

  1. I have also noticed something I don't understanding for the logging: first, it says that

[2021-05-31 15:02:13,185][fairseq.data.extracted_features_dataset][INFO] - loaded 3696, skipped 0 samples [2021-05-31 15:02:13,185][fairseq.tasks.unpaired_audio_text][INFO] - split train has unpaired text? True [2021-05-31 15:02:13,228][fairseq.data.data_utils][INFO] - loaded 3,696 examples from: /nobackup/users/junruin2/fairseq/examples/wav2vec/unsupervised/files/wav2vec_out/precompute_pca512_cls128_mean_pooled/train [2021-05-31 15:02:17,351][fairseq.trainer][INFO] - NOTE: your device may support faster training with --fp16 [2021-05-31 15:02:17,611][fairseq.trainer][INFO] - begin training epoch 1 [2021-05-31 15:02:17,612][fairseq_cli.train][INFO] - Start iterating over samples [2021-05-31 15:02:41,611][root][INFO] - Reducer buckets have been rebuilt in this iteration. [2021-05-31 15:02:43,533][fairseq_cli.train][INFO] - begin validation on "valid" subset [2021-05-31 15:02:58,392][valid][INFO] - {"epoch": 1, "valid_loss": "0.927", "valid_ntokens": "15334", "valid_nsentences": "400", "valid_lm_score_sum": -31856.94988822937, "valid_num_pred_chars": 13425.0, "valid_vocab_seen_pct": "1", "valid_uer": 92.72205556280163, "valid_weighted_lm_ppl": "201.512", "valid_lm_ppl": "201.512", "valid_wps": "0", "valid_wpb": "15334", "valid_bsz": "400", "valid_num_updates": "6"} [2021-05-31 15:02:58,396][fairseq.checkpoint_utils][INFO] - Preparing to save checkpoint for epoch 1 @ 6 updates [2021-05-31 15:02:58,398][fairseq.trainer][INFO] - Saving checkpoint to ./checkpoint_best.pt [2021-05-31 15:02:58,477][fairseq.trainer][INFO] - Finished saving checkpoint to ./checkpoint_best.pt [2021-05-31 15:02:58,534][fairseq.checkpoint_utils][INFO] - Saved checkpoint ./checkpoint_best.pt (epoch 1 @ 6 updates, score 201.51165633927232) (writing took 0.1390704633668065 seconds) [2021-05-31 15:02:58,534][fairseq_cli.train][INFO] - end of epoch 1 (average epoch stats below) [2021-05-31 15:02:58,541][train][INFO] - {"epoch": 1, "train_loss": "41.484", "train_ntokens": "616", "train_nsentences": "616", "train_temp": "2", "train_code_ppl": "14.041", "train_loss_code_pen": "0.044", "train_loss_smoothness": "15.845", "train_loss_dense_g": "0.695", "train_lm_score_sum": 0.0, "train_num_pred_chars": 0.0, "train_loss_grad_pen": "63.134", "train_loss_dense_d": "0.691", "train_loss_token_d": "0.693", "train_wps": "180.5", "train_ups": "0.3", "train_wpb": "616", "train_bsz": "616", "train_num_updates": "6", "train_lr_discriminator": "0.0005", "train_lr_generator": "0.0004", "train_gnorm": "34.87", "train_clip": "83.3", "train_train_wall": "10", "train_gb_free": "31.1", "train_wall": "45"}

Why does the log tell me that "train_ntokens": "616", "train_nsentences": "616", and that a single epoch finishes in 6 updates, even though TIMIT train set have 3696 examples and that the batch size in the config file has been set to 160?

  1. Although the UER are not great, for the purpose of getting the code to run, I have also tried running train.sh within kaldi_self_train directory. For the w2v features, which set should the script use? Should I use the segment-level, mean pooled features as used by GAN? Because if so, Kaldi would throw the error that utterance has too few frames to align . I could only start Kaldi training with those prepared in precompute_pca512, instead of those in precompute_pca512_cls128_mean_pooled

(Edit 06/03/2021: While I could get Kaldi started with the features in precompute_pca512, the script got stuck at steps/decode.sh --nj 20 --cmd "$decode_cmd" \ $exp_root/mono/graph $data/$valid $exp_root/mono/decode_$valid & within train_subset_lgbeam.sh)

Thanks very much for the help!

What's your environment?

JINGZIjingzi commented 3 years ago

I have the same problems with TIMIT and Librispeech-10h corpus. But even worse, I nearly got nothing for the decode results with _w2vugenerate.py. There is only one phone in a line for the result valid.txt of Librispeech-10h. Do you know how to fix this ……

JeromeNi commented 3 years ago

I have the same problems with TIMIT and Librispeech-10h corpus. But even worse, I nearly got nothing for the decode results with _w2vugenerate.py. There is only one phone in a line for the result valid.txt of Librispeech-10h. Do you know how to fix this ……

I have not experienced this issue. For me, the output all follows a certain pattern (probably of the LM) but seems to have completely ignored the acoustic features:

sil hh iy z sil b eh sil s iy ah l ah ng ah sil m oy sil t f ih n dh ah z ih z sil d aa sil m aa r sil sil hh iy t m sil p er sil k er ih w ih sil g eh v m ih dx ae n sil z ih z ah sil k ih sil ch eh v iy s ih m sil k y uw n hh eh r w eh sil s sil sil hh iy ih s sil d l uw y uw l sil ch ay n ih hh ae sil r ih k l sil sil hh iy m aa r ah sil d ih z t uw n ah sil k s sil eh dx ah l z w ah l iy sil t l ih sil k n l t ih l ah sil k er sil sil hh iy t aa r l ae n sil g r l ng sil ah sil p iy n sil g r iy ng sil z ih s sil b ih dh ah sil b uw s t aa l hh eh s sil sil hh iy ih n sil iy er t s ey sil k uw sil jh aa sil t ah sil k l ay er z aa er s sil k aa r s eh m iy s ng sil sil s iy t ow l sil t er s sil k ah m l ay l f iy uw iy r iy v w aa n dh ah t aa l er sil m ih dx ah n ih sil sil hh iy sil t r uw ng z iy aa l ah n d ih f er sil b l er sil d er l iy t ae n sil ah sil k jh aa s sil sil hh ih n g hh aw n sil g eh m er ih n sil t er m y uw ng sil k sil t ay ih sil k w eh ih n z ih sil k ah sil b l aa sil d eh dx er v er r ih dx ah sil k ih s sil k m sil p eh sil

shiva1393 commented 3 years ago

Hi @JeromeNi @JINGZIjingzi, I tried upto data preparation "zsh scripts/prepare_text.sh language /path/to/text/file /output/dir". later i put the code under data, models, tasks into fairseq's corresponding directories and changed init.py under those directories. But I'm not able to solve this error please give any suggestions and what extra changes needed to run this code.

File "/fairseq/fairseq/models/wav2vec_u.py", line 19, in from fairseq.models import BaseFairseqModel, register_model ImportError: cannot import name 'register_model' from partially initialized module 'fairseq.models' (most likely due to a circular import) (fairseq/fairseq/models/init.py)

JINGZIjingzi commented 3 years ago

I haven't met this before. Perhaps the file path is not correct. I copy the files in {fairseq_root}/fairseq/examples/wav2vec/unsupervised/{models, tasks, data} to {fairseq_root}/fairseq/{models/wav2vec, tasks, data}.

shiva1393 commented 3 years ago

Thank you @JINGZIjingzi,

I tried to run code by copying files from {fairseq_root}/fairseq/examples/wav2vec/unsupervised/{models, tasks, data} to {fairseq_root}/fairseq/{models/wav2vec, tasks, data}. i didn't get it.

I clearly expalin exact changes what i did. 1.fairseq/models/wav2vec folder: i)copy file {fairseq_root}/fairseq/examples/wav2vec/unsupervised/models/wav2vec_u.py to {fairseq_root}/fairseq/models/wav2vec/wav2vec_u.py ii) added one moreline .wav2vec_u import Wav2vec_U in {fairseq_root}/fairseq/models/wav2vec/init.py

2.fairseq/tasks folder i)copy file {fairseq_root}/fairseq/examples/wav2vec/unsupervised/tasks/unpaired_audio_text.py to {fairseq_root}/fairseq/tasks/unpaired_audio_text.py

3.fairseq/data folder i)copy files {fairseq_root}/fairseq/examples/wav2vec/unsupervised/data/extracted_features_dataset.py, random_input_dataset.py to {fairseq_root}/fairseq/data/random_input_dataset.py, extracted_features_dataset.py ii) added 4 lines in {fairseq_root}/fairseq/data/init.py from .extracted_features_dataset import ExtractedFeaturesDataset from .random_input_dataset import RandomInputDataset all = [ "already existed import names", "ExtractedFeaturesDataset", "RandomInputDataset", ]

I'm trying preprocessing and training from week i strucked at this place. Could you plz give any suggestions. Thank you

JINGZIjingzi commented 3 years ago

Are the log messages still the same? I can not figure out what is wrong with your codes. I made the same changes as you described above.

shiva1393 commented 3 years ago

Hi @JINGZIjingzi,

Before i kept code in {fairseq_root}/fairseq/{models, tasks, data} i got previously posted error.

From ur suggestion i changed to {fairseq_root}/fairseq/{models/wav2vec, tasks, data}.

Then i'm getting following error File "fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main distributed_utils.call_main(cfg, pre_main) File "fairseq/fairseq/distributed/utils.py", line 369, in call_main main(cfg, **kwargs) File "fairseq/fairseq_cli/train.py", line 88, in main task = tasks.setup_task(cfg.task) File "fairseq/fairseq/tasks/init.py", line 45, in setup_task assert task is not None, f"Could not infer task type from {cfg}. Available tasks: {TASK_REGISTRY.keys()}" AssertionError: Could not infer task type from {'_name': 'unpaired_audio_text', 'data': 'fairseq/exp/feats/precompute_pca512_cls128_mean_pooled', 'text_data': 'fairseq/exp/lm/phones', 'labels': 'phn', 'sort_by_length': False, 'unfiltered': False, 'max_length': None, 'append_eos': False, 'kenlm_path': 'fairseq/exp/lm/phones/lm.phones.filtered.04.bin'}. Available tasks: dict_keys(['hubert_pretraining', 'speech_recognition', 'gan_audio_pretraining_feats', 'translation', 'multilingual_translation', 'semisupervised_translation', 'translation_multi_simple_epoch', 'translation_from_pretrained_bart', 'legacy_masked_lm', 'cross_lingual_lm', 'denoising', 'multilingual_denoising', 'language_modeling', 'audio_pretraining', 'multilingual_masked_lm', 'sentence_ranking', 'speech_to_text', 'simul_speech_to_text', 'simul_text_to_text', 'online_backtranslation', 'masked_lm', 'translation_lev', 'translation_from_pretrained_xlm', 'sentence_prediction', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt'])

JINGZIjingzi commented 3 years ago

I got it. You can change _@register_task("gan_audio_pretrainingfeats", dataclass=UnpairedAudioTextConfig) in tasks/unpaired_audiotext.py to @register_task("unpaired_audiotext", dataclass=UnpairedAudioTextConfig).

shiva1393 commented 3 years ago

Thank you @JINGZIjingzi, now i got it. If i get proper results i will inform.

JeromeNi commented 3 years ago

Update: The LibriSpeech-100h result seems to look fine. The phone error rate from wav2vec_generate is like 28% from one of the runs. I just messed up the order of transcriptions in *.phn files for evaluation earlier.

JINGZIjingzi commented 3 years ago

That's great! I am still stuck in the Librispeech-10h. My dict.phn.txt is like this. (The first number in each line is not included in the file. They are just showed to note the line.) But I don't know how to generate the corresponding .phn files. There are only .wrd and *.ltr files. So I did not use any labels when training a gan model. The results with w2vu_generate.py are nearly empty. Could u give some advice? Thanks a lot!

 1 s 4399
 2 n 4396
 3 ɪ 4256
 4 t 3656
 5 d 3250
 6 ɹ 2975
 7 l 2951
 8 ə 2812
 9 k 2779
 10 ɛ 2047
 11 z 2012
 12 p 1993
 13 m 1986
 14 æ  1537
 15 ɚ 1464
 16 ᵻ 1294
 17 ŋ  1223
 18 b 1206
 19 eɪ 1206
 20 f 1180
 21 i 1175
 22 v 920
 23 ʌ 900
 24 ʃ 823
 25 iː  803
 26 ɑ ː  779
 27 oʊ 775
 28 aɪ 755
 29 ɡ  755
 30 ɾ 750
 31 uː  699
 32 əl 678
 33 w 632
 34 dʒ 529
 35 h 501
...
 57 aɪə 40
 58 n̩ 35
 59 ʔ 34
 60 <SIL> 0
JeromeNi commented 3 years ago

That's great! I am still stuck in the Librispeech-10h. My dict.phn.txt is like this. (The first number in each line is not included in the file. They are just showed to note the line.) But I don't know how to generate the corresponding .phn files. There are only .wrd and *.ltr files. So I did not use any labels when training a gan model. The results with w2vu_generate.py are nearly empty. Could u give some advice? Thanks a lot!

1 s 4399
2 n 4396
3 ɪ 4256
4 t 3656
5 d 3250
6 ɹ 2975
7 l 2951
8 ə 2812
9 k 2779
10 ɛ 2047
11 z 2012
12 p 1993
13 m 1986
14 æ  1537
15 ɚ 1464
16 ᵻ 1294
17 ŋ  1223
18 b 1206
19 eɪ 1206
20 f 1180
21 i 1175
22 v 920
23 ʌ 900
24 ʃ 823
25 iː  803
26 ɑ ː  779
27 oʊ 775
28 aɪ 755
29 ɡ  755
30 ɾ 750
31 uː  699
32 əl 678
33 w 632
34 dʒ 529
35 h 501
...
57 aɪə 40
58 n̩ 35
59 ʔ 34
60 <SIL> 0

I used this package reference in the footnote of the paper (https://github.com/Kyubyong/g2p) for converting English text to phoneme transcriptions. I removed all the numerical stress from the vowels and an additional (') symbol to get the following 39-phoneme set as described in the original paper:

AH 336212 N 238016 S 209100 T 194878 L 188633 IH 182116 R 172703 K 154411 IY 138375 Z 128619 D 124602 M 113743 ER 101165 EH 100869 AA 98322 AE 84627 B 81689 P 80531 OW 69927 G 55230 F 53820 EY 47962 UW 43357 V 42622 AO 42569 W 38987 AY 35320 HH 34826 NG 32503 SH 30457 JH 25340 Y 21735 CH 20112 TH 17105 AW 13323 UH 8315 OY 5219 DH 2440 ZH 1404

0

You can use the following code to generate *.phn files (I modified it from one of the scripts provided just to remove additional lexical stress) https://gist.github.com/JeromeNi/2d3118d9685a9ea4cdcc66d5bc8659c8

To train the GAN, you may not need the parallel text from LibriSpeech, but you do need some sort of non-parallel text. I simply used the LibriSpeech LM corpus (https://www.openslr.org/11/; librispeech-lm-norm.txt.gz), which I believe was also used in the authors' experiments regarding LibriSpeech. The readme file there says that “the texts were selected so that to avoid using texts that have even partial overlap with the books on which the LibriSpeech training and development sets are based". The corpus is fairly large, and can cause some computational burden when generating the FSTs (though likely not required for GAN training ), but I just used the whole corpus anyway.

I have yet to figure out why the model fails on TIMIT though.

JINGZIjingzi commented 3 years ago

Thank you so much! I will try again by your steps.

JINGZIjingzi commented 3 years ago

I have installed wav2letter python bindings, which is a bit easier to install.

JINGZIjingzi commented 3 years ago

@JeromeNi I tried the new pipeline, and the results seem a bit better. The PER of my valid set is 64.4% with viterbi decoder, but for my test set (dev_clean in Librispeech) the PER is 89.3%. To train faster, I only use 1% of the librispeech-lm-norm.txt text data and 10h subset from Librispeech for audio data. I also change max_update in GAN training from 150000 to 50000. I am not sure whether the amount of data is the main reason for the bad results. Have you ever tried smaller datasize and how is the performance?

Besides, I meet another problem when generating the results with w2vu_generate.py. The outputs of Generator in wav2vec_u.py include some nan symbols. The VITERBI decoder seems fine with the nans while the KENLM decoder would output none for the emissions with nan. So I can not use w2vu_generate.py to obtain the results with kenlm decoder. Do you met this before?

JINGZIjingzi commented 3 years ago

@JeromeNi I tried the new pipeline, and the results seem a bit better. The PER of my valid set is 64.4% with viterbi decoder, but for my test set (dev_clean in Librispeech) the PER is 89.3%. To train faster, I only use 1% of the librispeech-lm-norm.txt text data and 10h subset from Librispeech for audio data. I also change max_update in GAN training from 150000 to 50000. I am not sure whether the amount of data is the main reason for the bad results. Have you ever tried smaller datasize and how is the performance?

Besides, I meet another problem when generating the results with w2vu_generate.py. The outputs of Generator in wav2vec_u.py include some nan symbols. The VITERBI decoder seems fine with the nans while the KENLM decoder would output none for the emissions with nan. So I can not use w2vu_generate.py to obtain the results with kenlm decoder. Do you met this before?

The problems are fixed. I didn't remove the silence of test set before. The results seem right when I prepare the test data again. The PER of test (dev_clean in Librispeech) is 24%.

JeromeNi commented 3 years ago

@JeromeNi I tried the new pipeline, and the results seem a bit better. The PER of my valid set is 64.4% with viterbi decoder, but for my test set (dev_clean in Librispeech) the PER is 89.3%. To train faster, I only use 1% of the librispeech-lm-norm.txt text data and 10h subset from Librispeech for audio data. I also change max_update in GAN training from 150000 to 50000. I am not sure whether the amount of data is the main reason for the bad results. Have you ever tried smaller datasize and how is the performance? Besides, I meet another problem when generating the results with w2vu_generate.py. The outputs of Generator in wav2vec_u.py include some nan symbols. The VITERBI decoder seems fine with the nans while the KENLM decoder would output none for the emissions with nan. So I can not use w2vu_generate.py to obtain the results with kenlm decoder. Do you met this before?

The problems are fixed. I didn't remove the silence of test set before. The results seem right when I prepare the test data again. The PER of test (dev_clean in Librispeech) is 24%.

That's great! Is it the PER reported by wav2vec_generate directly? This is also around the PER I get when using 100h of LibriSpeech clean. How many different hyperparameter settings have you tried so far?

JINGZIjingzi commented 3 years ago

@JeromeNi I tried the new pipeline, and the results seem a bit better. The PER of my valid set is 64.4% with viterbi decoder, but for my test set (dev_clean in Librispeech) the PER is 89.3%. To train faster, I only use 1% of the librispeech-lm-norm.txt text data and 10h subset from Librispeech for audio data. I also change max_update in GAN training from 150000 to 50000. I am not sure whether the amount of data is the main reason for the bad results. Have you ever tried smaller datasize and how is the performance? Besides, I meet another problem when generating the results with w2vu_generate.py. The outputs of Generator in wav2vec_u.py include some nan symbols. The VITERBI decoder seems fine with the nans while the KENLM decoder would output none for the emissions with nan. So I can not use w2vu_generate.py to obtain the results with kenlm decoder. Do you met this before?

The problems are fixed. I didn't remove the silence of test set before. The results seem right when I prepare the test data again. The PER of test (dev_clean in Librispeech) is 24%.

That's great! Is it the PER reported by wav2vec_generate directly? This is also around the PER I get when using 100h of LibriSpeech clean. How many different hyperparameter settings have you tried so far?

Yes. The PER 24% is reported by wav2vec_generate directly. I use 10h audio in Librispeech and 10% text of the librispeech-lm-norm.txt. The hyperparameter settings are original except for _best_checkpointmetric in the config. I change it from _weighted_lmppl to uer, so the labels of valid set are needed. At the same time, the experiment with _weighted_lmppl shows poor performance. The PER of test is 86%. Do you have any idea about this?

JeromeNi commented 3 years ago

I did not use UER because I wanted the experiment to be completely unsupervised. How many seeds have you tried? I remembered that for one of the random seeds within the range of 0-4, I got much higher LM perplexity score with a much higher UER as well.

There has been an update to the w2v-u yesterday so I am still updating my local repo before I could try again...

JINGZIjingzi commented 3 years ago

The seed is set to 1 for all the exps. I will try other seeds. Thanks!

shiva1393 commented 3 years ago

Hi @JINGZIjingzi ,@JeromeNi, After training 100h librispeech speech data and librispeech-lm-norm.txt text data, i got following results

ref : I HAVE NOT THE PLEASURE OF UNDERSTANDING YOU SAID HE ref : AY HH AE V N AA T DH AH P L EH ZH ER AH V AH N D ER S T AE N D IH NG Y UW S EH D HH IY hyp : M AY D DH AE T AH N DH AH HH AA R TH NG IH NG K AH DH ER S AH N K AH N L IY K AO T M AE IH NG

ref: SAID SYLVIA SHIVERING ALL OVER WITH PASSION ref: S EH D S IH L V IY AH SH IH V ER IH NG AO L OW V ER W IH TH P AE SH AH N hyp: AO T ER T AH SH AH K IY M R OW P AH EH R AH N D HH EH N R OW IH NG

I'm getting wrong results but number of phonemes lenght in a sequence matching approximately original phoneme sequence. Where we need to change seed, while training i didn't get that.

JeromeNi commented 3 years ago

Hi @JINGZIjingzi ,@JeromeNi, After training 100h librispeech speech data and librispeech-lm-norm.txt text data, i got following results

ref : I HAVE NOT THE PLEASURE OF UNDERSTANDING YOU SAID HE ref : AY HH AE V N AA T DH AH P L EH ZH ER AH V AH N D ER S T AE N D IH NG Y UW S EH D HH IY hyp : M AY D DH AE T AH N DH AH HH AA R TH NG IH NG K AH DH ER S AH N K AH N L IY K AO T M AE IH NG

ref: SAID SYLVIA SHIVERING ALL OVER WITH PASSION ref: S EH D S IH L V IY AH SH IH V ER IH NG AO L OW V ER W IH TH P AE SH AH N hyp: AO T ER T AH SH AH K IY M R OW P AH EH R AH N D HH EH N R OW IH NG

I'm getting wrong results but number of phonemes lenght in a sequence matching approximately original phoneme sequence. Where we need to change seed, while training i didn't get that.

How many iterations have you run? From the log, it seems that I did not get lower than 80% UER during training until past 30k updates, but from there it only took an additional 10k updates to achieve lower than 30% UER.

Or maybe check if your reference transcripts in .wrd/.phn match the order within those in *.tsv? Running wav2vec_manifest.py on a new directory of silence-removed audio files may rearrange the utterance orders in the new tsv files.

shiva1393 commented 3 years ago

Hi, This below log is from final epoch 834 @ 150000 updates, Train log: {"epoch": 834, "train_loss": "3.973", "train_ntokens": "157.383", "train_nsentences": "157.383", "train_temp": "0.1", "train_code_ppl": "16.575", "train_loss_code_pen": "0.644", "train_loss_smoothness": "2.901", "train_loss_dense_g": "4.421", "train_lm_score_sum": 0.0, "train_num_pred_chars": 0.0, "train_loss_grad_pen": "0.06", "train_loss_dense_d": "0.026", "train_loss_token_d": "0.024", "train_wps": "672", "train_ups": "4.27", "train_wpb": "157.4", "train_bsz": "157.4", "train_num_updates": "150000", "train_lr_discriminator": "0.0005", "train_lr_generator": "0.0004", "train_gnorm": "34.501", "train_clip": "53.3", "train_train_wall": "11", "train_gb_free": "22.3", "train_wall": "29399"}.

valid: I'm not using any validation set just to overcome error i kept few utterances with random text. because it is unsupervised it won't depend on valid set right? after training completely i checked generated phonemes with original phoneme sequence. Here i made any mistake?

using wav2vec_vox_new.pt model for extracting wav2vec2 features from 14th layer. librispeech-lm-norm.txt text used: dict.phn.txt AH 336212 N 238016 S 209100 T 194878 L 188633 IH 182116 R 172703 K 154411 IY 138376 Z 128619 D 124602 M 113743 ER 101165 EH 100869 AA 98322 AE 84627 B 81689 P 80531 OW 69927 G 55230 F 53820 EY 47962 UW 43357 V 42622 AO 42569 W 38987 AY 35319 HH 34826 NG 32503 SH 30457 JH 25340 Y 21735 CH 20112 TH 17105 AW 13323 UH 8315 OY 5219 DH 2440 ' 2339 ZH 1404 \<SIL> 0 From above discussion i will remove (') symbol later. Finally are you able to map phonemes to words with given WER in paper?

JINGZIjingzi commented 3 years ago

I'm not using any validation set just to overcome error i kept few utterances with random text. because it is unsupervised it won't depend on valid set right? after training completely i checked generated phonemes with original phoneme sequence. Here i made any mistake?

I think validation set is needed though it is unsupervised.

Finally are you able to map phonemes to words with given WER in paper?

I have some problems to generate the HLG.fst file, which is needed for kaldi decoder. So I can't get the WER so far.

shiva1393 commented 3 years ago

Thank you @JINGZIjingzi ,@JeromeNi, prepared feats again may be i missed something before, used seed 0 while training, epochs 1000 around 80hours librispeech clean dataset , tested on same dataset (librispeech 80 hours)got 21 PER.

Ning107 commented 3 years ago

@JeromeNi @JINGZIjingzi @shiva1393 Can you please explain a little bit why should we copy the code files under {fairseq_root}/fairseq/. I don't understand.

Thanks in advance for your help!