flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Fine tune model using fork command #797

Closed akhaled89 closed 3 years ago

akhaled89 commented 4 years ago

Question

I'm trying to fine tune seq2seq model using fork command and I got this message error target contains elements out of valid range [0, num_categories) in categorical cross entropy

Additional Context

here is the recipe file

--datadir=/home/akhaled/w2l --runname=seq2seq_tds --rundir=/home/akhaled/w2l --archdir=/root/model --tokensdir=/home/akhaled/w2l/lm --arch=am.arch --train=lists/train-clean-100.lst --valid=lists/dev-clean.lst --lexicon=/home/akhaled/w2l/lm/librispeech-train+dev-unigram-10000-nbest10.lexicon --tokens=librispeech-train-all-unigram-10000.tokens --criterion=ctc --lr=0.05 --lrcrit=0.05 --momentum=0.0 --stepsize=40 --gamma=0.5 --maxgradnorm=15 --mfsc=true --dataorder=outputspiral --inputbinsize=25 --filterbanks=80 --attention=keyvalue --encoderdim=512 --attnWindow=softPretrain --softwstd=4 --trainWithWindow=true --pretrainWindow=3 --maxdecoderoutputlen=120 --usewordpiece=true --wordseparator= --sampletarget=0.01 --target=ltr --batchsize=16 --labelsmooth=0.05 --nthread=6 --memstepsize=4194304 --eostoken=true --pcttraineval=1 --pctteacherforcing=99 --iter=5000

and here is the logs 0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=4; --sqnorm=false; --stepsize=40; --surround=; --tag=; --target=ltr; --test=; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/home/akhaled/w2l/lm; --train=lists/train-clean-100.lst; --trainWithWindow=true; --transdiag=0; --unkscore=-inf; --usememcache=false; --uselexicon=true; --usewordpiece=true; --valid=lists/dev-clean.lst; --warmup=1; --weightdecay=0; --wordscore=1; --wordseparator=; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=5; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; I0826 08:01:04.997812 383 Train.cpp:152] Experiment path: /home/akhaled/w2l/seq2seq_tds I0826 08:01:04.997829 383 Train.cpp:153] Experiment runidx: 1 I0826 08:01:05.003846 383 Train.cpp:199] Number of classes (network): 9999 I0826 08:01:05.905391 383 Train.cpp:206] Number of words: 89612 I0826 08:01:06.391042 383 Train.cpp:252] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> output] (0): View (-1 80 1 0) (1): Conv2D (1->10, 5x21, 2,1, SAME,SAME, 1, 1) (with bias) (2): ReLU (3): Dropout (0.200000) (4): LayerNorm ( axis : { 0 1 2 } , size : -1) (5): Time-Depth Separable Block (21, 80, 10) [800 -> 800 -> 800] (6): Time-Depth Separable Block (21, 80, 10) [800 -> 800 -> 800] (7): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias) (8): ReLU (9): Dropout (0.200000) (10): LayerNorm ( axis : { 0 1 2 } , size : -1) (11): Time-Depth Separable Block (21, 80, 14) [1120 -> 1120 -> 1120] (12): Time-Depth Separable Block (21, 80, 14) [1120 -> 1120 -> 1120] (13): Time-Depth Separable Block (21, 80, 14) [1120 -> 1120 -> 1120] (14): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias) (15): ReLU (16): Dropout (0.200000) (17): LayerNorm ( axis : { 0 1 2 } , size : -1) (18): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440] (19): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440] (20): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440] (21): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440] (22): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440] (23): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440] (24): View (0 1440 1 0) (25): Reorder (1,0,3,2) (26): Linear (1440->1024) (with bias) I0826 08:01:06.391160 383 Train.cpp:253] [Network Params: 36539300] I0826 08:01:06.391193 383 Train.cpp:254] [Criterion] Seq2SeqCriterion I0826 08:01:06.391221 383 Train.cpp:262] [Network Optimizer] SGD I0826 08:01:06.391245 383 Train.cpp:263] [Criterion Optimizer] SGD I0826 08:01:07.078770 383 W2lListFilesDataset.cpp:141] 28539 files found. I0826 08:01:07.079272 383 Utils.cpp:102] Filtered 0/28539 samples I0826 08:01:07.081393 383 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 1784 I0826 08:01:07.357285 383 W2lListFilesDataset.cpp:141] 2703 files found. I0826 08:01:07.357358 383 Utils.cpp:102] Filtered 0/2703 samples I0826 08:01:07.357625 383 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 169 I0826 08:01:07.358604 383 Train.cpp:564] Shuffling trainset I0826 08:01:07.359069 383 Train.cpp:571] Epoch 1 started! terminate called after throwing an instance of 'std::invalid_argument' what(): target contains elements out of valid range [0, num_categories) in categorical cross entropy Aborted at 1598428873 (unix time) try "date -d @1598428873" if you are using GNU date PC: @ 0x7f224dca4e97 gsignal SIGABRT (@0x17f) received by PID 383 (TID 0x7f2293574380) from PID 383; stack trace: @ 0x7f228b88a890 (unknown) @ 0x7f224dca4e97 gsignal @ 0x7f224dca6801 abort @ 0x7f224e699957 (unknown) @ 0x7f224e69fab6 (unknown) @ 0x7f224e69faf1 std::terminate() @ 0x7f224e69fd24 __cxa_throw @ 0x5575941e115e fl::categoricalCrossEntropy() @ 0x5575940c671e w2l::Seq2SeqCriterion::forward() @ 0x557593f4a8c0 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddblE3_clES2_S5_S7_S9_S9_ddbl @ 0x557593ee1ac0 main @ 0x7f224dc87b97 __libc_start_main @ 0x557593f449ca _start could you please advice

tlikhomanenko commented 4 years ago

In the config you are setting --criterion=ctc, please remove it (some internal function will reuse this flags and wrong settings could be for the eostoken which will be not added in the token dict).

akhaled89 commented 4 years ago

I'm wondering if I should split the audios to small pieces about 15s , or I could finetune it without this step, as my videos 3-10 minutes , kindly advice

tlikhomanenko commented 4 years ago

I would split, here you need VOD model to proper split the data. This only depends on what GPU memory you have =) Also for transformers for not we don't restrict attention, so it can attend to any input position, so here it could be harder to train in the proper way. As far as I know people doing training on small chunks (never saw training on 10min).

akhaled89 commented 4 years ago

regarding the dataset to finetune the model, what is the recommended size I meant 2, or 10 hours will be enough to fine-tune the model or I need to build a dataset let's say with 100 hours or more? and I have another question about dividing the videos to small chunks 10-15 seconds do recommend a tool or solution to do that for both video and transcript as the length of the videos 10-20 minutes.

tlikhomanenko commented 4 years ago

Regarding training hours - this depends on a lot of factors, so there is open research (what model, what language, what lexicon intersection, what tokens). I would start with 2, then 10, checking what is enough for your model and results which you need to get.

About video dividing we have the tool and some released for English model which can be used with the tool, details here https://github.com/facebookresearch/wav2letter/tree/v0.2/tools#voice-activity-detection-with-ctc--an-n-gram-language-model, also you can have a look at the LibriLight dataset preparation where we did 36s audio splits https://github.com/facebookresearch/libri-light.