flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

error in training CPU & CUDA #313

Closed DemonPrince closed 5 years ago

DemonPrince commented 5 years ago

trying to run with CPU but I get an error:

Aborted at 1559152562 (unix time) try "date -d @1559152562" if you are using GNU date PC: @ 0x7f451213a6f9 mkldnn::impl::get_msec() SIGILL (@0x7f451213a6f9) received by PID 3764 (TID 0x7f451572abc0) from PID 303277817; stack trace: @ 0x7f450b901390 (unknown) @ 0x7f451213a6f9 mkldnn::impl::get_msec() @ 0x7f45121b894f mkldnn::impl::cpu::gemm_convolution_fwd_t::pd_t::create_primitive() @ 0x672b3d fl::conv2d() @ 0x652a66 fl::Conv2D::forward() @ 0x65e9df fl::UnaryModule::forward() @ 0x651a32 fl::Sequential::forward() @ 0x4725ae _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddbiE3_clES2_S5_S7_S9_S9_ddbi.constprop.10729 @ 0x419bb0 main @ 0x7f450a7fb830 __libc_start_main @ 0x46dd19 _start @ 0x0 (unknown) Illegal instruction (core dumped)

and GPU

terminate called after throwing an instance of 'af::exception' what(): ArrayFire Exception (Device out of memory:101): In function void* cuda::MemoryManager::nativeAlloc(size_t) In file src/backend/cuda/memory.cpp:191 CUDA Error (2): out of memory

In function af::array af::moddims(const af::array&, unsigned int, const dim_t*) In file src/api/cpp/data.cpp:198 Aborted at 1559157329 (unix time) try "date -d @1559157329" if you are using GNU date PC: @ 0x7f634ac62428 gsignal SIGABRT (@0x1817) received by PID 6167 (TID 0x7f63aa902780) from PID 6167; stack trace: @ 0x7f638ea00390 (unknown) @ 0x7f634ac62428 gsignal @ 0x7f634ac6402a abort @ 0x7f634b5a584d gnu_cxx::verbose_terminate_handler() @ 0x7f634b5a36b6 (unknown) @ 0x7f634b5a3701 std::terminate() @ 0x7f634b5a3919 cxa_throw @ 0x7f63528747c3 af::moddims() @ 0x7f6352874869 af::moddims() @ 0x608b82 fl::linear() @ 0x62967c fl::Linear::forward() @ 0x62c8af fl::UnaryModule::forward() @ 0x61eb82 fl::Sequential::forward() @ 0x46973b _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddbiE3_clES2_S5_S7_S9_S9_ddbi.constprop.9786 @ 0x419a23 main @ 0x7f634ac4d830 libc_start_main @ 0x465619 _start @ 0x0 (unknown) Aborted (core dumped)

jacobkahn commented 5 years ago

@DemonPrince:

tbfly commented 5 years ago

Try:

https://github.com/facebookresearch/wav2letter/issues/215

I solved it by calling 'apt-get install intel-mkl-64bit-2018.4-057' instead of installing the newest MKL version from the Intel website.

DemonPrince commented 5 years ago

Hi I using Ubuntu 16.04 in docker, two tests and I run test ArrayFire is the next: C compiler identification is GNU 5.4.0 Configuring incomplete, errors occurred. I see and my problem is CBLAS, although it is already installed. I'll keep trying, Thanks

jacobkahn commented 5 years ago

@DemonPrince — you can also try using our Docker images (docs here) rather than using an Ubuntu image in Docker. That'll let you sidestep building all of the dependencies.

DemonPrince commented 5 years ago

Hi, thanks to @tbfly and @jacobkahn , I tried to execute the CPU option, and it did not generate another result. I have also executed the Docker container well. Details of the result: root@52c9a6946121:~/wav2letter/build# ./Train train --flagsfile ../tutorials/1-librispeech_clean/train.cfg I0619 17:40:44.467330 3032 Train.cpp:136] Gflags after parsing --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=; --arch=network.arch; --archdir=/root/wav2letter/tutorials/1-librispeech_clean/; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=4; --beamsize=2500; --beamthreshold=25; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/root/; --dataorder=input; --decodertype=wrd; --devwin=0; --emission_dir=; --enable_distributed=false; --encoderdim=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=40; --flagsfile=../tutorials/1-librispeech_clean/train.cfg; --gamma=1; --garbage=false; --gumbeltemperature=1; --hardselection=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --iter=100; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=; --linlr=-1; --linlrcrit=-1; --linseg=0; --listdata=false; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=1; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.10000000000000001; --lrcrit=0; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=4; --nthread_decoder=1; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --pretrainWindow=0; --replabel=2; --reportiters=0; --rightWindowSize=50; --rndv_filepath=; --rundir=/root/; --runname=librispeech_clean_trainlogs; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --sclite=; --seed=0; --show=false; --showletters=false; --silweight=0; --smearing=none; --smoothingtemperature=1; --softselection=inf; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=|; --tag=; --target=tkn; --test=; --tokens=data/tokens.txt; --tokensdir=/root/; --train=data/train-clean-100; --trainWithWindow=false; --transdiag=0; --unkweight=-inf; --usewordpiece=false; --valid=data/dev-clean; --weightdecay=0; --wordscore=1; --wordseparator=|; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; I0619 17:40:44.467406 3032 Train.cpp:137] Experiment path: /root/librispeech_clean_trainlogs I0619 17:40:44.467416 3032 Train.cpp:138] Experiment runidx: 1 I0619 17:40:44.469054 3032 Train.cpp:166] Number of classes (network): 31 I0619 17:40:44.469152 3032 Train.cpp:187] Loading architecture file from /root/wav2letter/tutorials/1-librispeech_clean/network.arch I0619 17:40:44.630853 3032 Train.cpp:208] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> output] (0): View (-1 1 40 0) (1): Conv2D (40->256, 8x1, 2,1, SAME,SAME, 1, 1) (with bias) (2): ReLU (3): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (4): ReLU (5): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (6): ReLU (7): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (8): ReLU (9): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (10): ReLU (11): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (12): ReLU (13): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (14): ReLU (15): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (16): ReLU (17): Reorder (2,0,3,1) (18): Linear (256->512) (with bias) (19): ReLU (20): Linear (512->31) (with bias) I0619 17:40:44.630913 3032 Train.cpp:209] [Network Params: 3901471] I0619 17:40:44.630920 3032 Train.cpp:210] [Criterion] ConnectionistTemporalClassificationCriterion I0619 17:40:44.630931 3032 Train.cpp:218] [Network Optimizer] SGD I0619 17:40:44.630936 3032 Train.cpp:219] [Criterion Optimizer] SGD I0619 17:40:44.631451 3032 NumberedFilesLoader.cpp:29] Adding dataset /root/data/train-clean-100 ... I0619 17:40:44.674011 3032 NumberedFilesLoader.cpp:68] 28539 files found. I0619 17:43:53.451750 3032 Utils.cpp:102] Filtered 0/28539 samples I0619 17:43:53.456358 3032 W2lNumberedFilesDataset.cpp:57] Total batches (i.e. iters): 7135 I0619 17:43:53.456569 3032 NumberedFilesLoader.cpp:29] Adding dataset /root/data/dev-clean ... I0619 17:43:53.467399 3032 NumberedFilesLoader.cpp:68] 2703 files found. I0619 17:44:01.875672 3032 Utils.cpp:102] Filtered 0/2703 samples I0619 17:44:01.875967 3032 W2lNumberedFilesDataset.cpp:57] Total batches (i.e. iters): 676 I0619 17:44:01.921031 3032 Train.cpp:493] Shuffling trainset I0619 17:44:01.922049 3032 Train.cpp:500] Epoch 1 started! *** Aborted at 1560966242 (unix time) try "date -d @1560966242" if you are using GNU date *** PC: @ 0x7f5b6fa116f9 mkldnn::impl::get_msec() *** SIGILL (@0x7f5b6fa116f9) received by PID 3032 (TID 0x7f5b732dfbc0) from PID 1872828153; stack trace: *** @ 0x7f5b691d8390 (unknown) @ 0x7f5b6fa116f9 mkldnn::impl::get_msec() @ 0x7f5b6fa8f94f mkldnn::impl::cpu::gemm_convolution_fwd_t::pd_t::create_primitive() @ 0x6a08dd fl::conv2d() @ 0x67d8f6 fl::Conv2D::forward() @ 0x68b45f fl::UnaryModule::forward() @ 0x67c8c2 fl::Sequential::forward() @ 0x47c25f _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddbiE3_clES2_S5_S7_S9_S9_ddbi.constprop.11263 @ 0x419df3 main @ 0x7f5b68301830 __libc_start_main @ 0x477779 _start @ 0x0 (unknown) Illegal instruction (core dumped) I forgot using the train continue: root@52c9a6946121:~/wav2letter/build# ./Train continue --flagsfile ../tutorials/1-librispeech_clean/train.cfg E0619 18:14:03.629981 4177 Serial.h:74] Error while loading: failed to open file for reading E0619 18:14:04.630724 4177 Serial.h:74] Error while loading: failed to open file for reading E0619 18:14:06.631014 4177 Serial.h:74] Error while loading: failed to open file for reading E0619 18:14:10.631307 4177 Serial.h:74] Error while loading: failed to open file for reading E0619 18:14:18.631597 4177 Serial.h:74] Error while loading: failed to open file for reading E0619 18:14:34.631917 4177 Serial.h:74] Error while loading: failed to open file for reading terminate called after throwing an instance of 'std::runtime_error' what(): failed to open file for reading *** Aborted at 1560968074 (unix time) try "date -d @1560968074" if you are using GNU date *** PC: @ 0x7fe685847428 gsignal *** SIGABRT (@0x1051) received by PID 4177 (TID 0x7fe690810bc0) from PID 4177; stack trace: *** @ 0x7fe686709390 (unknown) @ 0x7fe685847428 gsignal @ 0x7fe68584902a abort @ 0x7fe68618a84d __gnu_cxx::__verbose_terminate_handler() @ 0x7fe6861886b6 (unknown) @ 0x7fe686188701 std::terminate() @ 0x7fe686188969 __cxa_rethrow @ 0x49255f w2l::retryWithBackoff<>() @ 0x41a661 main @ 0x7fe685832830 __libc_start_main @ 0x477779 _start @ 0x0 (unknown) Aborted (core dumped)