wav2vec unsupervised GAN training poor results

marcinkusz commented 3 years ago

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

I am trying to use wav2vec unsupervised to train a Lithuanian speech recognition model. I have successfully converted audio and text data (see below), but am struggling with getting an acceptable UER. Currently, after training for 150k steps, the UER improved from 180% (measured after 30k steps) to ~120%, which is still ridiculously bad. I describe my process below, I would like to know what I'm doing wrong.

Code

All code snippets are in the what have you tried section.

What have you tried?

Audio prep:

I started by removing silences, preparing transcription and split files for the conversion. The results were audio files without silences stored in /home/usr/w2v/audio/clips_no_silence/clips and transcription/split files {train, test, valid}.{tsv, phn, wrd, ltr} + dict.{train, test, valid} files stored in /home/usr/w2v/audio/clips_no_silence/manifests. I used the splits provided by commonvoice. Headers of prepared files:

train.tsv

``` /home/usr/w2v/audio/clips_no_silence/clips common_voice_lt_25141073.wav 36320 common_voice_lt_25170795.wav 50880 ... ```

train.phn

``` d a ɭ i s r uː ɕʲ uː t a r p u s a vʲ iː j e k rʲ iː ʒ mʲ i n a ʂ i kʲ ɪ t aː dʲ ie n aː b rʲ ɪ t uː p a j ee ɡ oː s a tʲ ʂ i t r au kʲ ee j uː r a k oː p ɭ iː tɕʲ aː s u p a a n t k a pʲ e i s u ɡ e ɭ e ʒ i nʲ e i s i r a k mʲ e nʲ i nʲ e i s k rʲ iː ʒ e i s ... ```

train.wrd

``` dalis rūšių tarpusavyje kryžminasi kitą dieną britų pajėgos atsitraukė jūra koplyčią supa antkapiai su geležiniais ir akmeniniais kryžiais ... ```

train.ltr

dict.train

``` a a abajus a b a j u s abdullah a b d u l̩ l̩ a h abejonių a bʲ e j oː nʲʲ uː abejotinos a bʲ e j oː tʲ ɪ n oː s ... ```

Next, I used the prepare_audio.sh script to prepare audio for training. I used

zsh prepare_audio.sh /home/usr/w2v/audio/clips_no_silence/manifests /home/usr/w2v/preprocessed_audio /home/usr/w2v/models/xlsr_53_56k.pt

I realise that the results would be better if I used a pre-trained wav2vec model specifially for Lithuanian. I am currently working on making it work with this pretrained model. Even with this, the UER is still too high in my opinion.

Audio conversion logs

**Note:** this was done on google colab so path logs are different from the settings above. I adjusted the values in the tsv files before conversion. ``` using 512 dim for PCA extracting from layer 14 processing splits: train valid test tcmalloc: large alloc 1269563392 bytes == 0x55ed64b96000 @ 0x7fc8ab458b6b 0x7fc8ab478379 0x7fc84fb8626e 0x7fc84fb879e2 0x7fc894a13c11 0x7fc8a6754e26 0x7fc8a61e5b88 0x55ed12f94bf8 0x55ed130086f2 0x55ed13003235 0x55ed12f9573a 0x55ed13003b0e 0x55ed13003235 0x55ed12f9534b 0x55ed12f94e59 0x55ed130dc25d 0x55ed1304bc3b 0x55ed12f93f01 0x55ed13085c0d 0x55ed130080d8 0x55ed13003235 0x55ed12ed4e2c 0x55ed13005318 0x55ed13002c35 0x55ed12f9573a 0x55ed1300493b 0x55ed13002c35 0x55ed12f9573a 0x55ed13003b0e 0x55ed13002c35 0x55ed12f9573a tcmalloc: large alloc 1269563392 bytes == 0x55edb0656000 @ 0x7fc8ab458b6b 0x7fc8ab478379 0x7fc84fb8626e 0x7fc84fb879e2 0x7fc894a13c11 0x7fc8a6754e26 0x7fc8a61e5b88 0x55ed12f94bf8 0x55ed130086f2 0x55ed13003235 0x55ed12f9573a 0x55ed13003b0e 0x55ed13003235 0x55ed12f9534b 0x55ed12f94e59 0x55ed130dc25d 0x55ed1304bc3b 0x55ed12f93f01 0x55ed13085c0d 0x55ed130080d8 0x55ed13003235 0x55ed12ed4e2c 0x55ed13005318 0x55ed13002c35 0x55ed12f9573a 0x55ed1300493b 0x55ed13002c35 0x55ed12f9573a 0x55ed13003b0e 0x55ed13002c35 0x55ed12f9573a 100% 4743/4743 [09:36<00:00, 8.22it/s] tcmalloc: large alloc 1269563392 bytes == 0x56061550c000 @ 0x7f0c18289b6b 0x7f0c182a9379 0x7f0bbc9b726e 0x7f0bbc9b89e2 0x7f0c01844c11 0x7f0c13585e26 0x7f0c13016b88 0x5605c4ddebf8 0x5605c4e526f2 0x5605c4e4d235 0x5605c4ddf73a 0x5605c4e4db0e 0x5605c4e4d235 0x5605c4ddf34b 0x5605c4ddee59 0x5605c4f2625d 0x5605c4e95c3b 0x5605c4dddf01 0x5605c4ecfc0d 0x5605c4e520d8 0x5605c4e4d235 0x5605c4d1ee2c 0x5605c4e4f318 0x5605c4e4cc35 0x5605c4ddf73a 0x5605c4e4e93b 0x5605c4e4cc35 0x5605c4ddf73a 0x5605c4e4db0e 0x5605c4e4cc35 0x5605c4ddf73a tcmalloc: large alloc 1269563392 bytes == 0x560660fcc000 @ 0x7f0c18289b6b 0x7f0c182a9379 0x7f0bbc9b726e 0x7f0bbc9b89e2 0x7f0c01844c11 0x7f0c13585e26 0x7f0c13016b88 0x5605c4ddebf8 0x5605c4e526f2 0x5605c4e4d235 0x5605c4ddf73a 0x5605c4e4db0e 0x5605c4e4d235 0x5605c4ddf34b 0x5605c4ddee59 0x5605c4f2625d 0x5605c4e95c3b 0x5605c4dddf01 0x5605c4ecfc0d 0x5605c4e520d8 0x5605c4e4d235 0x5605c4d1ee2c 0x5605c4e4f318 0x5605c4e4cc35 0x5605c4ddf73a 0x5605c4e4e93b 0x5605c4e4cc35 0x5605c4ddf73a 0x5605c4e4db0e 0x5605c4e4cc35 0x5605c4ddf73a 100% 33/33 [00:23<00:00, 1.38it/s] tcmalloc: large alloc 1269563392 bytes == 0x55f7bc654000 @ 0x7f8dba585b6b 0x7f8dba5a5379 0x7f8d5ecb326e 0x7f8d5ecb49e2 0x7f8da3b40c11 0x7f8db5881e26 0x7f8db5312b88 0x55f76b7dfbf8 0x55f76b8536f2 0x55f76b84e235 0x55f76b7e073a 0x55f76b84eb0e 0x55f76b84e235 0x55f76b7e034b 0x55f76b7dfe59 0x55f76b92725d 0x55f76b896c3b 0x55f76b7def01 0x55f76b8d0c0d 0x55f76b8530d8 0x55f76b84e235 0x55f76b71fe2c 0x55f76b850318 0x55f76b84dc35 0x55f76b7e073a 0x55f76b84f93b 0x55f76b84dc35 0x55f76b7e073a 0x55f76b84eb0e 0x55f76b84dc35 0x55f76b7e073a tcmalloc: large alloc 1269563392 bytes == 0x55f808114000 @ 0x7f8dba585b6b 0x7f8dba5a5379 0x7f8d5ecb326e 0x7f8d5ecb49e2 0x7f8da3b40c11 0x7f8db5881e26 0x7f8db5312b88 0x55f76b7dfbf8 0x55f76b8536f2 0x55f76b84e235 0x55f76b7e073a 0x55f76b84eb0e 0x55f76b84e235 0x55f76b7e034b 0x55f76b7dfe59 0x55f76b92725d 0x55f76b896c3b 0x55f76b7def01 0x55f76b8d0c0d 0x55f76b8530d8 0x55f76b84e235 0x55f76b71fe2c 0x55f76b850318 0x55f76b84dc35 0x55f76b7e073a 0x55f76b84f93b 0x55f76b84dc35 0x55f76b7e073a 0x55f76b84eb0e 0x55f76b84dc35 0x55f76b7e073a 100% 3398/3398 [38:33<00:00, 1.47it/s] Faiss Specs: [faiss_spec(pca=0, norm=False, n_clus=128, sphere=False, spec_str='CLUS128')] tcmalloc: large alloc 1269563392 bytes == 0x56011909e000 @ 0x7fba38479b6b 0x7fba38499379 0x7fb9cf0e126e 0x7fb9cf0e29e2 0x7fba13f6ec11 0x7fba25cafe26 0x7fba25740b88 0x5600c8585bf8 0x5600c85f96f2 0x5600c85f4235 0x5600c858673a 0x5600c85f4b0e 0x5600c85f4235 0x5600c858634b 0x5600c8585e59 0x5600c86cd25d 0x5600c863cc3b 0x5600c8584f01 0x5600c8676c0d 0x5600c85f90d8 0x5600c85f4235 0x5600c84c5e2c 0x5600c85f6318 0x5600c85f3c35 0x5600c858673a 0x5600c85f593b 0x5600c85f3c35 0x5600c858673a 0x5600c85f8f40 0x5600c8586b99 0x5600c85c7bc9 tcmalloc: large alloc 1269563392 bytes == 0x560164b5e000 @ 0x7fba38479b6b 0x7fba38499379 0x7fb9cf0e126e 0x7fb9cf0e29e2 0x7fba13f6ec11 0x7fba25cafe26 0x7fba25740b88 0x5600c8585bf8 0x5600c85f96f2 0x5600c85f4235 0x5600c858673a 0x5600c85f4b0e 0x5600c85f4235 0x5600c858634b 0x5600c8585e59 0x5600c86cd25d 0x5600c863cc3b 0x5600c8584f01 0x5600c8676c0d 0x5600c85f90d8 0x5600c85f4235 0x5600c84c5e2c 0x5600c85f6318 0x5600c85f3c35 0x5600c858673a 0x5600c85f593b 0x5600c85f3c35 0x5600c858673a 0x5600c85f8f40 0x5600c8586b99 0x5600c85c7bc9 100% 4743/4743 [55:31<00:00, 1.42it/s] tcmalloc: large alloc 3574292480 bytes == 0x560290dec000 @ 0x7fba384971e7 0x7fba3590446e 0x7fba35954c7b 0x7fba35954d18 0x7fba359fc010 0x7fba359fc73c 0x7fba359fc85d 0x5600c85872b8 0x7fba35941ef7 0x5600c8584f97 0x5600c8584da0 0x5600c85f8bb3 0x5600c85f3c35 0x5600c858673a 0x5600c85f8f40 0x5600c858665a 0x5600c85f4b0e 0x5600c85f3c35 0x5600c85f3933 0x5600c86bd402 0x5600c86bd77d 0x5600c86bd626 0x5600c8695313 0x5600c8694fbc 0x7fba37281bf7 0x5600c8694e9a (872630, 1024) Processing spec faiss_spec(pca=0, norm=False, n_clus=128, sphere=False, spec_str='CLUS128') Computing kmeans Clustering 872630 points in 1024D to 128 clusters, redo 3 times, 50 iterations Preprocessing in 0.94 s Outer iteration 0 / 3 Iteration 49 (59.77 s, search 39.75 s): objective=7.65299e+09 imbalance=1.076 nsplit=0 Objective improved: keep new clusters Outer iteration 1 / 3 Iteration 49 (119.56 s, search 79.52 s): objective=7.64923e+09 imbalance=1.079 nsplit=0 Objective improved: keep new clusters Outer iteration 2 / 3 Iteration 49 (179.23 s, search 119.21 s): objective=7.68091e+09 imbalance=1.078 nsplit=0 Faiss Spec: faiss_spec(pca=0, norm=False, n_clus=128, sphere=False, spec_str='CLUS128') Loaded centroids (128, 1024) tcmalloc: large alloc 1269563392 bytes == 0x56102bfd2000 @ 0x7f894efbeb6b 0x7f894efde379 0x7f88e5c2626e 0x7f88e5c279e2 0x7f892aab3c11 0x7f893c7f4e26 0x7f893c285b88 0x560fc9ee4bf8 0x560fc9f586f2 0x560fc9f53235 0x560fc9ee573a 0x560fc9f53b0e 0x560fc9f53235 0x560fc9ee534b 0x560fc9ee4e59 0x560fca02c25d 0x560fc9f9bc3b 0x560fc9ee3f01 0x560fc9fd5c0d 0x560fc9f580d8 0x560fc9f53235 0x560fc9e24e2c 0x560fc9f55318 0x560fc9f52c35 0x560fc9ee573a 0x560fc9f5493b 0x560fc9f52c35 0x560fc9ee573a 0x560fc9f57f40 0x560fc9ee5b99 0x560fc9f26bc9 tcmalloc: large alloc 1269563392 bytes == 0x561077a92000 @ 0x7f894efbeb6b 0x7f894efde379 0x7f88e5c2626e 0x7f88e5c279e2 0x7f892aab3c11 0x7f893c7f4e26 0x7f893c285b88 0x560fc9ee4bf8 0x560fc9f586f2 0x560fc9f53235 0x560fc9ee573a 0x560fc9f53b0e 0x560fc9f53235 0x560fc9ee534b 0x560fc9ee4e59 0x560fca02c25d 0x560fc9f9bc3b 0x560fc9ee3f01 0x560fc9fd5c0d 0x560fc9f580d8 0x560fc9f53235 0x560fc9e24e2c 0x560fc9f55318 0x560fc9f52c35 0x560fc9ee573a 0x560fc9f5493b 0x560fc9f52c35 0x560fc9ee573a 0x560fc9f57f40 0x560fc9ee5b99 0x560fc9f26bc9 100% 4743/4743 [03:46<00:00, 20.92it/s] Faiss Spec: faiss_spec(pca=0, norm=False, n_clus=128, sphere=False, spec_str='CLUS128') Loaded centroids (128, 1024) tcmalloc: large alloc 1269563392 bytes == 0x55c87053c000 @ 0x7f2d2f045b6b 0x7f2d2f065379 0x7f2cc5cad26e 0x7f2cc5cae9e2 0x7f2d0ab3ac11 0x7f2d1c87be26 0x7f2d1c30cb88 0x55c80e46ebf8 0x55c80e4e26f2 0x55c80e4dd235 0x55c80e46f73a 0x55c80e4ddb0e 0x55c80e4dd235 0x55c80e46f34b 0x55c80e46ee59 0x55c80e5b625d 0x55c80e525c3b 0x55c80e46df01 0x55c80e55fc0d 0x55c80e4e20d8 0x55c80e4dd235 0x55c80e3aee2c 0x55c80e4df318 0x55c80e4dcc35 0x55c80e46f73a 0x55c80e4de93b 0x55c80e4dcc35 0x55c80e46f73a 0x55c80e4e1f40 0x55c80e46fb99 0x55c80e4b0bc9 tcmalloc: large alloc 1269563392 bytes == 0x55c8bbffc000 @ 0x7f2d2f045b6b 0x7f2d2f065379 0x7f2cc5cad26e 0x7f2cc5cae9e2 0x7f2d0ab3ac11 0x7f2d1c87be26 0x7f2d1c30cb88 0x55c80e46ebf8 0x55c80e4e26f2 0x55c80e4dd235 0x55c80e46f73a 0x55c80e4ddb0e 0x55c80e4dd235 0x55c80e46f34b 0x55c80e46ee59 0x55c80e5b625d 0x55c80e525c3b 0x55c80e46df01 0x55c80e55fc0d 0x55c80e4e20d8 0x55c80e4dd235 0x55c80e3aee2c 0x55c80e4df318 0x55c80e4dcc35 0x55c80e46f73a 0x55c80e4de93b 0x55c80e4dcc35 0x55c80e46f73a 0x55c80e4e1f40 0x55c80e46fb99 0x55c80e4b0bc9 100% 33/33 [00:21<00:00, 1.52it/s] Faiss Spec: faiss_spec(pca=0, norm=False, n_clus=128, sphere=False, spec_str='CLUS128') Loaded centroids (128, 1024) tcmalloc: large alloc 1269563392 bytes == 0x5568f5b6a000 @ 0x7fcc64fc2b6b 0x7fcc64fe2379 0x7fcbfbc2a26e 0x7fcbfbc2b9e2 0x7fcc40ab7c11 0x7fcc527f8e26 0x7fcc52289b88 0x5568934dabf8 0x55689354e6f2 0x556893549235 0x5568934db73a 0x556893549b0e 0x556893549235 0x5568934db34b 0x5568934dae59 0x55689362225d 0x556893591c3b 0x5568934d9f01 0x5568935cbc0d 0x55689354e0d8 0x556893549235 0x55689341ae2c 0x55689354b318 0x556893548c35 0x5568934db73a 0x55689354a93b 0x556893548c35 0x5568934db73a 0x55689354df40 0x5568934dbb99 0x55689351cbc9 tcmalloc: large alloc 1269563392 bytes == 0x55694162a000 @ 0x7fcc64fc2b6b 0x7fcc64fe2379 0x7fcbfbc2a26e 0x7fcbfbc2b9e2 0x7fcc40ab7c11 0x7fcc527f8e26 0x7fcc52289b88 0x5568934dabf8 0x55689354e6f2 0x556893549235 0x5568934db73a 0x556893549b0e 0x556893549235 0x5568934db34b 0x5568934dae59 0x55689362225d 0x556893591c3b 0x5568934d9f01 0x5568935cbc0d 0x55689354e0d8 0x556893549235 0x55689341ae2c 0x55689354b318 0x556893548c35 0x5568934db73a 0x55689354a93b 0x556893548c35 0x5568934db73a 0x55689354df40 0x5568934dbb99 0x55689351cbc9 100% 3398/3398 [40:00<00:00, 1.42it/s] Reading features Computing PCA data path: /content/gdrive/MyDrive/legego/preprocessed_audio_new/train 0% 0/1 [00:00

Audio conversion output

The main directory contains all of the aforementioned transcription files and `{test, valid, train}.{npy, lengths}` files. `preprocessed_audio/pca` contains `512_pca_A.npy` and `512_pca_b.npy` `preprocessed_audio/CLUS128` contains `centroids.npy`, `{test, train, valid}.{phn, src, tsv}` There are 3 `preprocessed_audio/precompute_pca512*` (cls128 mean and mean_pooled) directories, all containing `{test, valid, train}.{lengths, npy, phn, tsv, wrd}`

Text prep:

For text data, I am using the Leipzig university corpora, Wikipedia 2020, Web 2016 and 2020, News 2020 and Newscrawl 2014-2016. I filter all sentences with non Lithuanian letters and run the prepares_text.sh script. My output directory is /home/usr/w2v/preprocessed_text

Text conversion output

The `preprocessed_text` directory contains - `dict.txt`, - `kenlm.wrd.o40003.{arpa, bin}`, - `lexicon.lst`, - `lexicon_filtered.lst`, - `lm.upper.lid.txt`, - `phones.txt`, - `preprocess.log`, - `words.txt`. `preprocessed_text/phones` contains: - `dict.phn.txt`, - `dict.txt`, - `lm.phones.filtered.{04, 06}.{arpa, bin}`, - `lm.phones.filtered.txt`, - `preprocessed.log`, - `train.{bin, idx}`. `preprocessed_text/fst` contains three directories (`phn_to_phn_sil`, `phn_to_words`, `phn_to_words_sil`), which all contain: - `G_lm.phones.filtered.06.fst`, - `H.phn.fst`, `H.phn.fst.isym`, - `H.phn.fstisym_disambig.int`, - `HLG.phn.lm.phones.filtered.06.fst`, - `HLGa.phn.lm.phones.filtered.06.fst`, - `kaldi_dict.h_out.phn.txt`, - `kaldi_dict.lm.phones.filtered.06.txt`, - `kaldi_dict.lm.phones.filtered.06.txt_disamig`, - `kaldi_dict.phn.txt`, - `kaldi_dict.phn_disambig.txt`, - `kaldi_dict.phn_disambig.txt.int`, - `kaldi_dict.phn_disambig.txt_disamig`, - `kaldi_lexicon.phn.lm.phones.filtered.06.txt`, - `kaldi_lexicon.phn.lm.phones.filtered.06_disambig.txt`, - `L.phn.lm.phones.filtered.06.fst`, - `LG.phn.lm.phones.filtered.06.fst`.

GAN training:

For gan training, I use the following code:

PREFIX="w2v_unsup_gan_xp"

# Path to audio/precompute_pca512_cls128_mean_pooled  
TASK_DATA="/home/usr/w2v/preprocessed_audio/precompute_pca512_cls128_mean_pooled"

# path to fairseq-preprocessed GAN data (phones dir)
TEXT_DATA="/home/usr/w2v/preprocessed_text/phones"

# KenLM 4-gram phoneme language model (LM data = GAN data here)
KENLM_PATH="$/home/usr/w2v/preprocessed_text/phones/lm.phones.filtered.04.bin"

OUT_PATH=/home/usr/w2v/gan_out
mkdir -p $OUT_PATH
# Path to the the config file of the unsupervised gan model
CONFIG_PATH="${FAIRSEQ_ROOT}/examples/wav2vec/unsupervised/config/gan"

PYTHONPATH=$FAIRSEQ_ROOT PREFIX=$PREFIX CUDA_LAUNCH_BLOCKING=1 fairseq-hydra-train \
    -m --config-dir ${CONFIG_PATH} \
    --config-name w2vu \
    task.data=${TASK_DATA} \
    task.text_data=${TEXT_DATA} \
    task.kenlm_path=${KENLM_PATH} \
    checkpoint.no_epoch_checkpoints=true \
    checkpoint.keep_last_epochs=20 \
    checkpoint.save_dir=${OUT_PATH} \
    common.user_dir=${FAIRSEQ_ROOT}/examples/wav2vec/unsupervised \
    model.code_penalty=2,4 model.gradient_penalty=1.5,2.0 \
    model.smoothness_weight=0.5,0.75,1.0 'common.seed=range(0,5)'

I didn't edit the config files at all.

A snippet of the gan training logs

``` [2021-07-31 16:17:00,565][fairseq.trainer][INFO] - begin training epoch 612 [2021-07-31 16:17:00,566][fairseq_cli.train][INFO] - Start iterating over samples [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [2021-07-31 16:17:13,969][fairseq_cli.train][INFO] - end of epoch 612 (average epoch stats below) [2021-07-31 16:17:13,972][train][INFO] - {"epoch": 612, "train_loss": "2.545", "train_ntokens": "153", "train_nsentences": "153", "train_temp": "0.775", "train_code_ppl": "10.728", "train_loss_code_pen": "0.312", "train_loss_smoothness": "1.724", "train_loss_dense_g": "2.912", "train_lm_score_sum": 0.0, "train_num_pred_chars": 0.0, "train_loss_grad_pen": "0.356", "train_loss_dense_d": "0.072", "train_loss_token_d": "0.067", "train_wps": "352.9", "train_ups": "2.31", "train_wpb": "153", "train_bsz": "153", "train_num_updates": "18972", "train_lr_discriminator": "0.0005", "train_lr_generator": "0.0004", "train_gnorm": "25.175", "train_clip": "90.3", "train_train_wall": "13", "train_gb_free": "4.8", "train_wall": "7566"} [2021-07-31 16:17:14,009][fairseq.trainer][INFO] - begin training epoch 613 [2021-07-31 16:17:14,010][fairseq_cli.train][INFO] - Start iterating over samples [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [2021-07-31 16:17:26,101][train_inner][INFO] - {"epoch": 613, "update": 612.903, "loss": "2.73", "ntokens": "153.49", "nsentences": "153.49", "temp": "0.775", "code_ppl": "10.712", "loss_code_pen": "0.306", "loss_smoothness": "1.707", "loss_dense_g": "2.98", "lm_score_sum": 0.0, "num_pred_chars": 0.0, "loss_grad_pen": "0.35", "loss_dense_d": "0.068", "loss_token_d": "0.075", "wps": "358", "ups": "2.33", "wpb": "153.5", "bsz": "153.5", "num_updates": "19000", "lr_discriminator": "0.0005", "lr_generator": "0.0004", "gnorm": "26.02", "clip": "93", "train_wall": "42", "gb_free": "4.8", "wall": "7578"} [2021-07-31 16:17:26,102][fairseq_cli.train][INFO] - begin validation on "valid" subset [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [2021-07-31 16:17:26,318][unsupervised.tasks.unpaired_audio_text][INFO] - REF: p a ɭ e i nʲ e rʲ iː b u v oː pʲ ɪ r k ɭʲ uː s a n dʲ ee ɭ e i [2021-07-31 16:17:26,321][unsupervised.tasks.unpaired_audio_text][INFO] - HYP: dʲ e l̩ s d u a b oː t a p a dʲ ee j oː s tʲ i s a ʒ e rʲ ɪ ɡ a r d a [2021-07-31 16:17:26,324][unsupervised.tasks.unpaired_audio_text][INFO] - LM [REF]: -44.36577224731445, 38.41420644057458 [2021-07-31 16:17:26,326][unsupervised.tasks.unpaired_audio_text][INFO] - LM [HYP]: -61.0944938659668, 81.13400851623487 [2021-07-31 16:17:26,331][valid][INFO] - {"epoch": 613, "valid_loss": "1.054", "valid_ntokens": "771", "valid_nsentences": "16.5", "valid_lm_score_sum": "-4181.57", "valid_num_pred_chars": "1885", "valid_vocab_seen_pct": "0.44", "valid_uer": "105.383", "valid_weighted_lm_ppl": "782.11", "valid_lm_ppl": "151.416", "valid_wps": "1431.4", "valid_wpb": "771", "valid_bsz": "16.5", "valid_num_updates": "19000", "valid_best_weighted_lm_ppl": "668.63"} [2021-07-31 16:17:26,332][fairseq.checkpoint_utils][INFO] - Preparing to save checkpoint for epoch 613 @ 19000 updates [2021-07-31 16:17:26,333][fairseq.trainer][INFO] - Saving checkpoint to /home/zygimantas/Downloads/gan_out_large_07_31/checkpoint_613_19000.pt [2021-07-31 16:17:26,362][fairseq.trainer][INFO] - Finished saving checkpoint to /home/zygimantas/Downloads/gan_out_large_07_31/checkpoint_613_19000.pt [2021-07-31 16:17:26,478][fairseq.checkpoint_utils][INFO] - Saved checkpoint /home/zygimantas/Downloads/gan_out_large_07_31/checkpoint_613_19000.pt (epoch 613 @ 19000 updates, score 782.1095486248666) (writing took 0.1455865339958109 seconds) [2021-07-31 16:17:27,639][fairseq_cli.train][INFO] - end of epoch 613 (average epoch stats below) [2021-07-31 16:17:27,642][train][INFO] - {"epoch": 613, "train_loss": "2.863", "train_ntokens": "153", "train_nsentences": "153", "train_temp": "0.774", "train_code_ppl": "10.737", "train_loss_code_pen": "0.3", "train_loss_smoothness": "1.679", "train_loss_dense_g": "2.936", "train_lm_score_sum": 0.0, "train_num_pred_chars": 0.0, "train_loss_grad_pen": "0.314", "train_loss_dense_d": "0.068", "train_loss_token_d": "0.074", "train_wps": "347", "train_ups": "2.27", "train_wpb": "153", "train_bsz": "153", "train_num_updates": "19003", "train_lr_discriminator": "0.0005", "train_lr_generator": "0.0004", "train_gnorm": "26.898", "train_clip": "96.8", "train_train_wall": "13", "train_gb_free": "5.2", "train_wall": "7580"} [2021-07-31 16:17:27,673][fairseq.trainer][INFO] - begin training epoch 614 [2021-07-31 16:17:27,674][fairseq_cli.train][INFO] - Start iterating over samples [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [2021-07-31 16:17:40,624][fairseq_cli.train][INFO] - end of epoch 614 (average epoch stats below) [2021-07-31 16:17:40,626][train][INFO] - {"epoch": 614, "train_loss": "2.676", "train_ntokens": "153", "train_nsentences": "153", "train_temp": "0.773", "train_code_ppl": "10.856", "train_loss_code_pen": "0.315", "train_loss_smoothness": "1.764", "train_loss_dense_g": "2.892", "train_lm_score_sum": 0.0, "train_num_pred_chars": 0.0, "train_loss_grad_pen": "0.473", "train_loss_dense_d": "0.071", "train_loss_token_d": "0.066", "train_wps": "365.4", "train_ups": "2.39", "train_wpb": "153", "train_bsz": "153", "train_num_updates": "19034", "train_lr_discriminator": "0.0005", "train_lr_generator": "0.0004", "train_gnorm": "27.159", "train_clip": "93.5", "train_train_wall": "13", "train_gb_free": "4.8", "train_wall": "7593"} [2021-07-31 16:17:40,662][fairseq.trainer][INFO] - begin training epoch 615 [2021-07-31 16:17:40,663][fairseq_cli.train][INFO] - Start iterating over samples [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [2021-07-31 16:17:53,370][fairseq_cli.train][INFO] - end of epoch 615 (average epoch stats below) [2021-07-31 16:17:53,372][train][INFO] - {"epoch": 615, "train_loss": "2.854", "train_ntokens": "153", "train_nsentences": "153", "train_temp": "0.772", "train_code_ppl": "10.729", "train_loss_code_pen": "0.298", "train_loss_smoothness": "1.698", "train_loss_dense_g": "2.966", "train_lm_score_sum": 0.0, "train_num_pred_chars": 0.0, "train_loss_grad_pen": "0.242", "train_loss_dense_d": "0.067", "train_loss_token_d": "0.072", "train_wps": "372.2", "train_ups": "2.43", "train_wpb": "153", "train_bsz": "153", "train_num_updates": "19065", "train_lr_discriminator": "0.0005", "train_lr_generator": "0.0004", "train_gnorm": "24.491", "train_clip": "80.6", "train_train_wall": "12", "train_gb_free": "5.3", "train_wall": "7606"} ```

After the training I use the w2vu_generate.py with the 4 gram lm and evaluate the generated files with wer.py. I am currently stuck figuring out why the training results are so poor, please help!

What's your environment?

fairseq Version: master
PyTorch Version: 1.9
OS: Ubuntu LTS 20.04.2
How you installed fairseq: source
Build command you used (if compiling from source): pip install --editable ./
Python version: 3.7.10
CUDA/cuDNN version: 11.2
GPU models and configuration: RTX2060 Mobile, 16Gb Ram, Intel i7-9750H
Any other relevant information: Pykaldi installed using conda, Kaldi and kenlm from source

lsrami commented 3 years ago

Hello, have you completed the realization of the wav2vec-u project? I have completed the entire process in this docker environment:wav2vec-u-exp but currently my WER is very high. I find it difficult to converge using this docker environment for training. Do you have any good suggestions?

devanshbatra04 commented 3 years ago

I am having the same issue, please update if you found the solution

lsrami commented 2 years ago

Hello, have you solved this problem now? I have the same problem. I think the hyperparameter setting during GAN training has a greater impact

lsrami commented 2 years ago

I am having the same issue, please update if you found the solution

Hello, have you solved this problem now? I have the same problem. I think the hyperparameter setting during GAN training has a greater impact

marcinkusz commented 2 years ago

I have tried verifying the results of the paper by applying my process to the Tatar language but failed to replicate the results, which must mean that there's something wrong with my process. After failing to progress I left the issue alone and haven't touched it for a while now.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

xiabingquan commented 1 year ago

I am having the same issue, please update if you found the solution

Hello, have you solved this problem now? I have the same problem. I think the hyperparameter setting during GAN training has a greater impact

True. The hyper-param setting seems to be very tricky 😒

facebookresearch / fairseq