Closed zelabean closed 2 years ago
@alexeib
what does "bad performance" mean? what dxoes the loss look like during pretraining? what is your dataset? what format are the audio files? what parameters did you use for training incl the number of gpus?
Dear alexeib, Many thanks for response my question.
1. what does "bad performance" mean? I'm evaluate performance of wav2vec on korean ASR task ( wav2letter ). When using libri-large wav2vec on asr , performance improved. But when using other korean dataset, then performance worse.
2. what does the loss look like during pretraining? During pre-training, loss are looks good. down to 0.14x or 0.15x in fairseq 0.9.0 but, when install with source, loss are 2.xx. vq-wav2vec loss minimum is 4.xx
3. what is your dataset? my pretraining dataset is 'aihub' , korean official dataset(http://www.aihub.or.kr/aidata/105), 1000h data, 02~35 sec length, 16khz sr, speech topic is 'everyday conversation' and my asr dataset is 'zeroth', korean open dataset(http://www.openslr.org/40/), 51.6h, topic is 'news script'.
4. what format are the audio files? aihub original format is headerless (little endian) linear PCM and to use, convert them wav with sox. zeroth original format is wav.
5. what parameters did you use for training incl the number of gpus? I'm using 8 x V100 GPU. and using various parameters, but result similar. on fairseq 0.9.0
--save-dir './pretrained' --num-workers 4 --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints --arch wav2vec --task audio_pretraining --lr 1e-06 --min-lr 1e-09 --optimizer adam --max-lr 0.005 --lr-scheduler cosine --conv-feature-layers '[(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)]' --conv-aggregator-layers '[(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)]' --skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 1e-07 --criterion binary_cross_entropy --num-negatives 10 --max-sample-size 150000 --max-tokens 1500000
and
--save-dir './pretrained' --num-workers 4 --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints --arch wav2vec --task audio_pretraining --lr 0.5e-06 --min-lr 0.5e-09 --optimizer adam --max-lr 0.0025 --lr-scheduler cosine --conv-feature-layers '[(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)]' --conv-aggregator-layers '[(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)]' --skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 0.5e-07 --criterion binary_cross_entropy --num-negatives 10 --max-sample-size 150000 --max-tokens 1500000
Installed from source, if i using --max-tokens 1500000 can't training, initial losses are odd likely 1.35e-13 so i'm using --max-tokens 600000
result: | epoch 076 | loss 0.156 | ppl 1.11 | wps 1.70274e+07 | ups 2 | wpb 7421468.302 | bsz 7421468.302 | num_updates 395348 | lr 2.67232e-06 | gnorm 0.019 | clip 0.000 | oom 0.000 | loss_scale 0.031 | wall 173449 | train_wall 170969 | epoch 076 | valid on 'valid' subset | loss 0.158 | ppl 1.12 | num_updates 395348 | best_loss 0.157551
and vq-wav2vec it also using max token 1500000, can't.
--num-workers 6 --max-update 400000 --fp16 --save-interval 1 --no-epoch-checkpoints --arch wav2vec --task audio_pretraining --lr 1e-9 --min-lr 1e-20 --optimizer adam --max-lr 1e-7 --lr-scheduler cosine --conv-feature-layers '[(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1), (512, 1, 1)]' --conv-aggregator-layers '[(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)]' --activation gelu --offset auto --skip-connections-agg --residual-scale 0.25 --log-keys '["prob_perplexity","code_perplexity","temp"]' --vq-type kmeans --loss-weights '[1]' --vq-groups 2 --vq-depth 1 --combine-groups --vq-vars 320 --prediction-steps 12 --warmup-updates 500 --warmup-init-lr 1e-10 --criterion binary_cross_entropy --num-negatives 10 --max-sample-size 150000 --max-tokens 600000 --cross-sample-negatives 0 --update-freq 1 --seed 36
two result of some experiment < --lr 1e-8 --min-lr 1e-20 --optimizer adam --max-lr 1e-6
2020-04-14 10:40:16 | INFO | train | epoch 034 | loss 6.33063 | code_perplexity 194.246 | loss_0 0.548508 | loss_1 0.261516 | wps 931646 | ups 3.01 | wpb 309074 | bsz 3.39963e+06 | num_updates 382241 | lr 1.48192e-08 | gnorm 2.296 | clip 0 | oom 0 | loss_scale 0 | train_wall 3699 | wall 127813 2020-04-14 10:40:36 | INFO | valid | epoch 034 | valid on 'valid' subset | loss 6.33526 | code_perplexity 194.309 | loss_0 0.562742 | loss_1 0.26766 | wps 2.77798e+06 | wpb 306856 | bsz 3.37523e+06 | num_updates 382241 | best_loss 4.71785
< --lr 1e-9 --min-lr 1e-20 --optimizer adam --max-lr 1e-7
2020-04-14 10:40:43 | INFO | train | epoch 034 | loss 4.72082 | code_perplexity 141.009 | loss_0 0.409028 | loss_1 0.082654 | wps 931955 | ups 3.02 | wpb 309064 | bsz 3.39951e+06 | num_updates 382240 | lr 1.48197e-09 | gnorm 0.601 | clip 0 | oom 0 | loss_scale 0 | train_wall 3698 | wall 127846 2020-04-14 10:41:03 | INFO | valid | epoch 034 | valid on 'valid' subset | loss 4.72126 | code_perplexity 141.279 | loss_0 0.419461 | loss_1 0.0843348 | wps 2.77846e+06 | wpb 306856 | bsz 3.37523e+06 | num_updates 382240 | best_loss 4.70437
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!
I have trained wav2vec and vq-wav2vec with another language wav file about 1 month and try very many case of hyper parameters.
but, performance still bad even in case of Libri speech 960h
Have anyone ever had a good result by training own dataset?
If you have a problem with hyperparameters, please provide the correct parameters.