facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.53k stars 6.41k forks source link

wav2vec and vq-wav2vec parameter #1885

Closed zelabean closed 2 years ago

zelabean commented 4 years ago

I have trained wav2vec and vq-wav2vec with another language wav file about 1 month and try very many case of hyper parameters.

but, performance still bad even in case of Libri speech 960h

Have anyone ever had a good result by training own dataset?

If you have a problem with hyperparameters, please provide the correct parameters.

huihuifan commented 4 years ago

@alexeib

alexeib commented 4 years ago

what does "bad performance" mean? what dxoes the loss look like during pretraining? what is your dataset? what format are the audio files? what parameters did you use for training incl the number of gpus?

zelabean commented 4 years ago

Dear alexeib, Many thanks for response my question.

1. what does "bad performance" mean? I'm evaluate performance of wav2vec on korean ASR task ( wav2letter ). When using libri-large wav2vec on asr , performance improved. But when using other korean dataset, then performance worse.

2. what does the loss look like during pretraining? During pre-training, loss are looks good. down to 0.14x or 0.15x in fairseq 0.9.0 but, when install with source, loss are 2.xx. vq-wav2vec loss minimum is 4.xx

3. what is your dataset? my pretraining dataset is 'aihub' , korean official dataset(http://www.aihub.or.kr/aidata/105), 1000h data, 02~35 sec length, 16khz sr, speech topic is 'everyday conversation' and my asr dataset is 'zeroth', korean open dataset(http://www.openslr.org/40/), 51.6h, topic is 'news script'.

4. what format are the audio files? aihub original format is headerless (little endian) linear PCM and to use, convert them wav with sox. zeroth original format is wav.

5. what parameters did you use for training incl the number of gpus? I'm using 8 x V100 GPU. and using various parameters, but result similar. on fairseq 0.9.0

--save-dir './pretrained' --num-workers 4 --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints --arch wav2vec --task audio_pretraining --lr 1e-06 --min-lr 1e-09 --optimizer adam --max-lr 0.005 --lr-scheduler cosine --conv-feature-layers '[(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)]' --conv-aggregator-layers '[(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)]' --skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 1e-07 --criterion binary_cross_entropy --num-negatives 10 --max-sample-size 150000 --max-tokens 1500000

and

--save-dir './pretrained' --num-workers 4 --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints --arch wav2vec --task audio_pretraining --lr 0.5e-06 --min-lr 0.5e-09 --optimizer adam --max-lr 0.0025 --lr-scheduler cosine --conv-feature-layers '[(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)]' --conv-aggregator-layers '[(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)]' --skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 0.5e-07 --criterion binary_cross_entropy --num-negatives 10 --max-sample-size 150000 --max-tokens 1500000

Installed from source, if i using --max-tokens 1500000 can't training, initial losses are odd likely 1.35e-13 so i'm using --max-tokens 600000

result: | epoch 076 | loss 0.156 | ppl 1.11 | wps 1.70274e+07 | ups 2 | wpb 7421468.302 | bsz 7421468.302 | num_updates 395348 | lr 2.67232e-06 | gnorm 0.019 | clip 0.000 | oom 0.000 | loss_scale 0.031 | wall 173449 | train_wall 170969 | epoch 076 | valid on 'valid' subset | loss 0.158 | ppl 1.12 | num_updates 395348 | best_loss 0.157551

and vq-wav2vec it also using max token 1500000, can't.

--num-workers 6 --max-update 400000 --fp16 --save-interval 1 --no-epoch-checkpoints --arch wav2vec --task audio_pretraining --lr 1e-9 --min-lr 1e-20 --optimizer adam --max-lr 1e-7 --lr-scheduler cosine --conv-feature-layers '[(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1), (512, 1, 1)]' --conv-aggregator-layers '[(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)]' --activation gelu --offset auto --skip-connections-agg --residual-scale 0.25 --log-keys '["prob_perplexity","code_perplexity","temp"]' --vq-type kmeans --loss-weights '[1]' --vq-groups 2 --vq-depth 1 --combine-groups --vq-vars 320 --prediction-steps 12 --warmup-updates 500 --warmup-init-lr 1e-10 --criterion binary_cross_entropy --num-negatives 10 --max-sample-size 150000 --max-tokens 600000 --cross-sample-negatives 0 --update-freq 1 --seed 36

two result of some experiment < --lr 1e-8 --min-lr 1e-20 --optimizer adam --max-lr 1e-6

2020-04-14 10:40:16 | INFO | train | epoch 034 | loss 6.33063 | code_perplexity 194.246 | loss_0 0.548508 | loss_1 0.261516 | wps 931646 | ups 3.01 | wpb 309074 | bsz 3.39963e+06 | num_updates 382241 | lr 1.48192e-08 | gnorm 2.296 | clip 0 | oom 0 | loss_scale 0 | train_wall 3699 | wall 127813 2020-04-14 10:40:36 | INFO | valid | epoch 034 | valid on 'valid' subset | loss 6.33526 | code_perplexity 194.309 | loss_0 0.562742 | loss_1 0.26766 | wps 2.77798e+06 | wpb 306856 | bsz 3.37523e+06 | num_updates 382241 | best_loss 4.71785

< --lr 1e-9 --min-lr 1e-20 --optimizer adam --max-lr 1e-7

2020-04-14 10:40:43 | INFO | train | epoch 034 | loss 4.72082 | code_perplexity 141.009 | loss_0 0.409028 | loss_1 0.082654 | wps 931955 | ups 3.02 | wpb 309064 | bsz 3.39951e+06 | num_updates 382240 | lr 1.48197e-09 | gnorm 0.601 | clip 0 | oom 0 | loss_scale 0 | train_wall 3698 | wall 127846 2020-04-14 10:41:03 | INFO | valid | epoch 034 | valid on 'valid' subset | loss 4.72126 | code_perplexity 141.279 | loss_0 0.419461 | loss_1 0.0843348 | wps 2.77846e+06 | wpb 306856 | bsz 3.37523e+06 | num_updates 382240 | best_loss 4.70437

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!