Help: OOM error (ArrayFire exception) during training a tds-s2s model

dh7ahn commented 4 years ago

Hello,

While training a TDS seq-to-seq model, (CUDA) OOM errors (ArrayFire exceptions) are raised constantly. Training configuration is:

5k hours of training data
DGX-1v (8 x V100, each 32GB),
model architecture based on am_tds_s2s.arch from SOTA/2019
20k SPM tokens I did my training several times with different parameter settings (batchsize, iter, maxisz, reportiters, memstepsize), but OOM exceptions are always raised at early stage. How can I solve the problem? Using the same configuration (except model arch), the streaming convnet is being well-trained

I0726 06:10:38.216768 14711 Train.cpp:151] Gflags after parsing --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=am_tds_s2s.arch; --archdir=am_arch; --attention=keyvalue; --attentionthreshold=2147483647; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=16; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=seq2seq; --critoptim=sgd; --datadir=; --dataorder=output_spiral; --decoderattnround=2; --decoderdropout=0.10000000000000001; --decoderrnnlayer=3; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=512; --eosscore=0; --eostoken=true; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=train_am_tds_s2s-large.cfg; --framesizems=30; --framestridems=10; --gamma=0.75; --gumbeltemperature=1; --input=flac; --inputbinsize=25; --inputfeeding=false; --isbeamdump=false; --iter=10000000000; --itersave=false; --labelsmooth=0.050000000000000003; --leftWindowSize=50; --lexicon=model-large/am/model-large-train+dev-unigram-20000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.059999999999999998; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0.059999999999999998; --max_devices_per_node=8; --maxdecoderoutputlen=120; --maxgradnorm=15; --maxisz=20000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10240; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.5; --netoptim=sgd; --noresample=false; --nthread=8; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=none; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=99; --pcttraineval=1; --pow=false; --pretrainWindow=3; --replabel=0; --reportiters=5000; --rightWindowSize=50; --rndv_filepath=; --rundir=model-large; --runname=am_tds_s2s-large; --samplerate=16000; --sampletarget=0.01; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=2; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=4; --sqnorm=false; --stepsize=2000000; --surround=; --tag=; --target=ltr; --test=; --tokens=model-large-unigram-20000.tokens; --tokensdir=model-large/am; --train=/root/exp/16khz/lists/model-large-tra.lst; --trainWithWindow=true; --transdiag=0; --unkscore=-inf; --usememcache=false; --uselexicon=true; --usewordpiece=true; --valid=dev:/root/exp/16khz/lists/model-large-dev.lst; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; I0726 06:10:38.217275 14711 Train.cpp:152] Experiment path: model-large/am_tds_s2s-large I0726 06:10:38.217279 14711 Train.cpp:153] Experiment runidx: 1 I0726 06:10:38.223557 14714 Train.cpp:199] Number of classes (network): 20000 I0726 06:10:38.224416 14713 Train.cpp:199] Number of classes (network): 20000 I0726 06:10:38.224618 14716 Train.cpp:199] Number of classes (network): 20000 I0726 06:10:38.224938 14717 Train.cpp:199] Number of classes (network): 20000 I0726 06:10:38.225425 14712 Train.cpp:199] Number of classes (network): 20000 I0726 06:10:38.225448 14715 Train.cpp:199] Number of classes (network): 20000 I0726 06:10:38.225740 14718 Train.cpp:199] Number of classes (network): 20000 I0726 06:10:38.226792 14711 Train.cpp:199] Number of classes (network): 20000 I0726 06:10:43.372723 14714 Train.cpp:206] Number of words: 897324 I0726 06:10:43.379463 14713 Train.cpp:206] Number of words: 897324 I0726 06:10:43.410107 14716 Train.cpp:206] Number of words: 897324 I0726 06:10:43.424656 14717 Train.cpp:206] Number of words: 897324 I0726 06:10:43.427860 14711 Train.cpp:206] Number of words: 897324 I0726 06:10:43.464855 14712 Train.cpp:206] Number of words: 897324 I0726 06:10:43.613556 14718 Train.cpp:206] Number of words: 897324 I0726 06:10:43.617025 14715 Train.cpp:206] Number of words: 897324 I0726 06:10:44.035785 14711 Train.cpp:220] Loading architecture file from am_arch/am_tds_s2s.arch I0726 06:10:46.006341 14711 Train.cpp:252] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> output] (0): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 ) (1): View (-1 80 1 0) (2): Conv2D (1->10, 21x1, 2,1, SAME,SAME, 1, 1) (with bias) (3): ReLU (4): Dropout (0.000000) (5): LayerNorm ( axis : { 0 1 2 } , size : -1) (6): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800] (7): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800] (8): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800] (9): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800] (10): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800] (11): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias) (12): ReLU (13): Dropout (0.000000) (14): LayerNorm ( axis : { 0 1 2 } , size : -1) (15): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120] (16): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120] (17): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120] (18): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120] (19): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120] (20): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120] (21): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias) (22): ReLU (23): Dropout (0.000000) (24): LayerNorm ( axis : { 0 1 2 } , size : -1) (25): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440] (26): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440] (27): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440] (28): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440] (29): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440] (30): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440] (31): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440] (32): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440] (33): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440] (34): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440] (35): View (0 1440 1 0) (36): Reorder (1,0,3,2) (37): Linear (1440->1024) (with bias) I0726 06:10:46.006460 14711 Train.cpp:253] [Network Params: 190462588] I0726 06:10:46.006520 14711 Train.cpp:254] [Criterion] Seq2SeqCriterion I0726 06:10:46.022763 14711 Train.cpp:262] [Network Optimizer] SGD (momentum=0.5) I0726 06:10:46.022789 14711 Train.cpp:263] [Criterion Optimizer] SGD I0726 06:11:07.768970 14712 W2lListFilesDataset.cpp:141] 6003510 files found. I0726 06:11:07.865386 14712 Utils.cpp:102] Filtered 821/6003510 samples I0726 06:11:07.870270 14717 W2lListFilesDataset.cpp:141] 6003510 files found. I0726 06:11:07.965108 14717 Utils.cpp:102] Filtered 821/6003510 samples I0726 06:11:08.137738 14714 W2lListFilesDataset.cpp:141] 6003510 files found. I0726 06:11:08.233012 14714 Utils.cpp:102] Filtered 821/6003510 samples I0726 06:11:08.592602 14711 W2lListFilesDataset.cpp:141] 6003510 files found. I0726 06:11:08.592602 14713 W2lListFilesDataset.cpp:141] 6003510 files found. I0726 06:11:08.684429 14716 W2lListFilesDataset.cpp:141] 6003510 files found. I0726 06:11:08.692782 14711 Utils.cpp:102] Filtered 821/6003510 samples I0726 06:11:08.693179 14713 Utils.cpp:102] Filtered 821/6003510 samples I0726 06:11:08.780925 14716 Utils.cpp:102] Filtered 821/6003510 samples I0726 06:11:08.963624 14715 W2lListFilesDataset.cpp:141] 6003510 files found. I0726 06:11:09.069039 14715 Utils.cpp:102] Filtered 821/6003510 samples I0726 06:11:09.132992 14718 W2lListFilesDataset.cpp:141] 6003510 files found. I0726 06:11:09.231608 14718 Utils.cpp:102] Filtered 821/6003510 samples I0726 06:11:09.528175 14712 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 46896 I0726 06:11:09.623977 14717 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 46896 I0726 06:11:09.918501 14714 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 46896 I0726 06:11:10.444139 14711 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 46896 I0726 06:11:10.445180 14713 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 46896 I0726 06:11:10.519066 14716 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 46896 I0726 06:11:10.834561 14715 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 46896 I0726 06:11:10.959957 14718 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 46896 I0726 06:11:11.573549 14712 W2lListFilesDataset.cpp:141] 20078 files found. I0726 06:11:11.573864 14712 Utils.cpp:102] Filtered 2/20078 samples I0726 06:11:11.576429 14712 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 157 I0726 06:11:11.645463 14717 W2lListFilesDataset.cpp:141] 20078 files found. I0726 06:11:11.645779 14717 Utils.cpp:102] Filtered 2/20078 samples I0726 06:11:11.648335 14717 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 157 I0726 06:11:11.946784 14714 W2lListFilesDataset.cpp:141] 20078 files found. I0726 06:11:11.947094 14714 Utils.cpp:102] Filtered 2/20078 samples I0726 06:11:11.949707 14714 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 157 I0726 06:11:12.487951 14711 W2lListFilesDataset.cpp:141] 20078 files found. I0726 06:11:12.488260 14711 Utils.cpp:102] Filtered 2/20078 samples I0726 06:11:12.490029 14713 W2lListFilesDataset.cpp:141] 20078 files found. I0726 06:11:12.490317 14713 Utils.cpp:102] Filtered 2/20078 samples I0726 06:11:12.490900 14711 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 157 I0726 06:11:12.492977 14713 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 157 I0726 06:11:12.498656 14711 Train.cpp:564] Shuffling trainset I0726 06:11:12.532735 14716 W2lListFilesDataset.cpp:141] 20078 files found. I0726 06:11:12.533036 14716 Utils.cpp:102] Filtered 2/20078 samples I0726 06:11:12.535691 14716 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 157 I0726 06:11:12.848047 14715 W2lListFilesDataset.cpp:141] 20078 files found. I0726 06:11:12.848336 14715 Utils.cpp:102] Filtered 2/20078 samples I0726 06:11:12.850936 14715 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 157 I0726 06:11:12.961169 14718 W2lListFilesDataset.cpp:141] 20078 files found. I0726 06:11:12.961506 14718 Utils.cpp:102] Filtered 2/20078 samples I0726 06:11:12.964121 14718 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 157 I0726 06:11:12.988827 14711 Train.cpp:571] Epoch 1 started! I0726 06:11:18.390285 14711 Train.cpp:728] Finished window pretraining. I0726 06:11:18.392300 14711 Train.cpp:564] Shuffling trainset I0726 06:11:18.412165 14711 Train.cpp:571] Epoch 2 started! I0726 06:35:05.265098 14711 Train.cpp:345] epoch: 2 | nupdates: 5000 | lr: 0.059957 | lrcriterion: 0.059957 | runtime: 00:23:21 | bch(ms): 280.24 | smp(ms): 1.31 | fwd(ms): 60.28 | crit-fwd(ms): 3.09 | bwd(ms): 187.91 | optim(ms): 25.52 | loss: 65.22172 | train-LER: 705.23 | train-WER: 741.86 | dev-loss: 53.41069 | dev-LER: 99.99 | dev-WER: 99.97 | avg-isz: 410 | avg-tsz: 007 | max-tsz: 031 | hrs: 730.53 | thrpt(sec/sec): 1876.51 Memory Manager Stats MemoryManager type: CachingMemoryManager Number of allocated blocks:462 Size of free block pool (small):98 Size of free block pool (large):334 Total native mallocs:440 Total native frees:0 I0726 06:59:08.439503 14711 Train.cpp:345] epoch: 2 | nupdates: 10000 | lr: 0.059914 | lrcriterion: 0.059914 | runtime: 00:23:28 | bch(ms): 281.79 | smp(ms): 0.82 | fwd(ms): 59.66 | crit-fwd(ms): 3.08 | bwd(ms): 192.88 | optim(ms): 25.44 | loss: 59.95948 | train-LER: 349.05 | train-WER: 409.66 | dev-loss: 55.12523 | dev-LER: 99.96 | dev-WER: 99.93 | avg-isz: 412 | avg-tsz: 007 | max-tsz: 032 | hrs: 732.88 | thrpt(sec/sec): 1872.60 Memory Manager Stats MemoryManager type: CachingMemoryManager Number of allocated blocks:462 Size of free block pool (small):105 Size of free block pool (large):356 Total native mallocs:456 Total native frees:0 I0726 07:22:58.247351 14711 Train.cpp:345] epoch: 2 | nupdates: 15000 | lr: 0.059871 | lrcriterion: 0.059871 | runtime: 00:23:13 | bch(ms): 278.64 | smp(ms): 0.87 | fwd(ms): 58.78 | crit-fwd(ms): 3.05 | bwd(ms): 190.48 | optim(ms): 25.46 | loss: 57.76739 | train-LER: 267.25 | train-WER: 340.69 | dev-loss: 51.20142 | dev-LER: 92.74 | dev-WER: 99.34 | avg-isz: 405 | avg-tsz: 007 | max-tsz: 033 | hrs: 721.39 | thrpt(sec/sec): 1864.09 Memory Manager Stats MemoryManager type: CachingMemoryManager Number of allocated blocks:462 Size of free block pool (small):94 Size of free block pool (large):355 Total native mallocs:456 Total native frees:0 I0726 07:46:44.225791 14711 Train.cpp:345] epoch: 2 | nupdates: 20000 | lr: 0.059828 | lrcriterion: 0.059828 | runtime: 00:23:06 | bch(ms): 277.27 | smp(ms): 0.88 | fwd(ms): 58.74 | crit-fwd(ms): 3.07 | bwd(ms): 190.44 | optim(ms): 25.46 | loss: 57.77191 | train-LER: 100.90 | train-WER: 109.33 | dev-loss: 51.00814 | dev-LER: 90.82 | dev-WER: 99.61 | avg-isz: 405 | avg-tsz: 007 | max-tsz: 040 | hrs: 721.07 | thrpt(sec/sec): 1872.45 Memory Manager Stats MemoryManager type: CachingMemoryManager Number of allocated blocks:462 Size of free block pool (small):107 Size of free block pool (large):352 Total native mallocs:456 Total native frees:0 Unable to allocate memory with native alloc for size 207618048 bytes with error 'ArrayFire Exception (Device out of memory:101): ArrayFire error: In function fl::MemoryManagerInstaller::MemoryManagerInstaller(std::shared_ptr)::<lambda(size_t)> In file /root/tools/flashlight-20200722/flashlight/memory/MemoryManagerInstaller.cpp:191'terminate called after throwing an instance of 'af::exception' what(): ArrayFire Exception (Device out of memory:101): In function af_err af_matmul(void**, af_array, af_array, af_mat_prop, af_mat_prop) In file src/api/c/blas.cpp:238

0# af_matmul in /opt/arrayfire/lib/libafcuda.so.3 1# af::matmul(af::array const&, af::array const&, af_mat_prop, af_mat_prop) in /opt/arrayfire/lib/libafcuda.so.3 2# 0x000055A490BA49D2 in Train 3# 0x000055A490BE50CC in Train 4# 0x000055A490BEB5BE in Train 5# 0x000055A490BD56CA in Train 6# 0x000055A490B0040B in Train 7# 0x000055A490BD56CA in Train 8# 0x000055A490909D23 in Train 9# 0x000055A49089F6B8 in Train 10# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6 11# 0x000055A490903FFA in Train

In function af::array af::matmul(const af::array&, const af::array&, af::matProp, af::matProp) In file src/api/cpp/blas.cpp:18 Aborted at 1595750180 (unix time) try "date -d @1595750180" if you are using GNU date PC: @ 0x7fb8d3994f47 gsignal SIGABRT (@0x397b) received by PID 14715 (TID 0x7fb919329980) from PID 14715; stack trace: @ 0x7fb8d4f048a0 (unknown) @ 0x7fb8d3994f47 gsignal @ 0x7fb8d39968b1 abort @ 0x7fb8d45b8957 (unknown) @ 0x7fb8d45beab6 (unknown) @ 0x7fb8d45beaf1 std::terminate() @ 0x7fb8d45bed24 cxa_throw @ 0x7fb8f6f4b694 af::matmul() @ 0x55a490ba49d2 fl::linear() @ 0x55a490be50cc fl::Linear::forward() @ 0x55a490beb5be fl::UnaryModule::forward() @ 0x55a490bd56ca fl::Sequential::forward() @ 0x55a490b0040b w2l::TDSBlock::forward() @ 0x55a490bd56ca fl::Sequential::forward() @ 0x55a490909d23 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddblE3_clES2_S5_S7_S9_S9_ddbl @ 0x55a49089f6b8 main @ 0x7fb8d3977b97 __libc_start_main @ 0x55a490903ffa _start Aborted at 1595750278 (unix time) try "date -d @1595750278" if you are using GNU date PC: @ 0x7fff0f095b62 ([vdso]+0xb61) SIGTERM (@0x396e) received by PID 14711 (TID 0x7f8282be1980) from PID 14702; stack trace: @ 0x7f823e7bc8a0 (unknown) @ 0x7fff0f095b62 ([vdso]+0xb61) @ 0x7f823d33f056 clock_gettime @ 0x7f823a1ecd0e (unknown) @ 0x7f823a2a7c77 (unknown) @ 0x7f823a1d53cc (unknown) @ 0x7f823a1d5529 (unknown) @ 0x7f823a0e0651 (unknown) @ 0x7f823a256370 cuStreamSynchronize @ 0x7f82625a8250 cudart::cudaApiStreamSynchronize() @ 0x7f82625e331d cudaStreamSynchronize @ 0x7f825fe5f849 cuda::sync() @ 0x7f82605061df af_sync @ 0x7f8260816247 af::sync() @ 0x564401d1414f _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddblE3_clES2_S5_S7_S9_S9_ddbl @ 0x564401ca96b8 main @ 0x7f823d22fb97 libc_start_main @ 0x564401d0dffa _start Aborted at 1595750278 (unix time) try "date -d @1595750278" if you are using GNU date PC: @ 0x7fffc519fb62 ([vdso]+0xb61) SIGTERM (@0x396e) received by PID 14717 (TID 0x7f5c0cc12980) from PID 14702; stack trace: @ 0x7f5bc87ed8a0 (unknown) @ 0x7fffc519fb62 ([vdso]+0xb61) @ 0x7f5bc7370056 clock_gettime @ 0x7f5bc421dd0e (unknown) @ 0x7f5bc42d8c77 (unknown) @ 0x7f5bc42063cc (unknown) @ 0x7f5bc4206529 (unknown) @ 0x7f5bc4111651 (unknown) @ 0x7f5bc4287370 cuStreamSynchronize @ 0x7f5bec5d9250 cudart::cudaApiStreamSynchronize() @ 0x7f5bec61431d cudaStreamSynchronize @ 0x7f5bec62b915 cuda::gemm<>() @ 0x7f5bea4c9702 af_gemm @ 0x7f5bea4caa2b af_matmul @ 0x7f5bea834544 af::matmul() @ 0x5605ddd422f6 fl::matmul() @ 0x5605ddd42da7 _ZNSt17_Function_handlerIFvRSt6vectorIN2fl8VariableESaIS2_EERKS2_EZNS1_8matmulNTES7_S7_EUlS5_S7_E_E9_M_invokeERKSt9_Any_dataS5S7 @ 0x5605ddd32f25 fl::Variable::backward() @ 0x5605ddd334ae fl::Variable::backward() @ 0x5605ddaaa12f _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddblE3_clES2_S5_S7_S9_S9_ddbl @ 0x5605dda3f6b8 main @ 0x7f5bc7260b97 libc_start_main @ 0x5605ddaa3ffa _start Aborted at 1595750278 (unix time) try "date -d @1595750278" if you are using GNU date PC: @ 0x7ffc32fe1b62 ([vdso]+0xb61) SIGTERM (@0x396e) received by PID 14716 (TID 0x7f2894d86980) from PID 14702; stack trace: @ 0x7f28509618a0 (unknown) @ 0x7ffc32fe1b62 ([vdso]+0xb61) @ 0x7f284f4e4056 clock_gettime @ 0x7f284c391d0e (unknown) @ 0x7f284c44cc77 (unknown) @ 0x7f284c37a3cc (unknown) @ 0x7f284c37a529 (unknown) @ 0x7f284c285651 (unknown) @ 0x7f284c3fb370 cuStreamSynchronize @ 0x7f287474d250 cudart::cudaApiStreamSynchronize() @ 0x7f287478831d cudaStreamSynchronize @ 0x7f2872004849 cuda::sync() @ 0x7f28726ab1df af_sync @ 0x7f28729bb247 af::sync() @ 0x563c4956d14f _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddblE3_clES2_S5_S7_S9_S9_ddbl @ 0x563c495026b8 main @ 0x7f284f3d4b97 libc_start_main @ 0x563c49566ffa _start Aborted at 1595750278 (unix time) try "date -d @1595750278" if you are using GNU date PC: @ 0x7ffedfee2b62 ([vdso]+0xb61) SIGTERM (@0x396e) received by PID 14712 (TID 0x7fbe69671980) from PID 14702; stack trace: @ 0x7fbe2524c8a0 (unknown) @ 0x7ffedfee2b62 ([vdso]+0xb61) @ 0x7fbe23dcf056 clock_gettime @ 0x7fbe20c7cd0e (unknown) @ 0x7fbe20d37c77 (unknown) @ 0x7fbe20c653cc (unknown) @ 0x7fbe20c65529 (unknown) @ 0x7fbe20b70651 (unknown) @ 0x7fbe20ce6370 cuStreamSynchronize @ 0x7fbe49038250 cudart::cudaApiStreamSynchronize() @ 0x7fbe4907331d cudaStreamSynchronize @ 0x7fbe468ef849 cuda::sync() @ 0x7fbe46f961df af_sync @ 0x7fbe472a6247 af::sync() @ 0x55e3882cc14f _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddblE3_clES2_S5_S7_S9_S9_ddbl @ 0x55e3882616b8 main @ 0x7fbe23cbfb97 libc_start_main @ 0x55e3882c5ffa _start Aborted at 1595750278 (unix time) try "date -d @1595750278" if you are using GNU date PC: @ 0x7fff0f9f0b62 ([vdso]+0xb61) SIGTERM (@0x396e) received by PID 14713 (TID 0x7fd064ad9980) from PID 14702; stack trace: @ 0x7fd0206b48a0 (unknown) @ 0x7fff0f9f0b62 ([vdso]+0xb61) @ 0x7fd01f237056 clock_gettime @ 0x7fd01c0e4d0e (unknown) @ 0x7fd01c19fc77 (unknown) @ 0x7fd01c0cd3cc (unknown) @ 0x7fd01c0cd529 (unknown) @ 0x7fd01bfd8651 (unknown) @ 0x7fd01c14e370 cuStreamSynchronize @ 0x7fd0444a0250 cudart::cudaApiStreamSynchronize() @ 0x7fd0444db31d cudaStreamSynchronize @ 0x7fd041d57849 cuda::sync() @ 0x7fd0423fe1df af_sync @ 0x7fd04270e247 af::sync() @ 0x564bc691d14f _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddblE3_clES2_S5_S7_S9_S9_ddbl @ 0x564bc68b26b8 main @ 0x7fd01f127b97 libc_start_main @ 0x564bc6916ffa _start Aborted at 1595750278 (unix time) try "date -d @1595750278" if you are using GNU date PC: @ 0x7ffcafbb2b62 ([vdso]+0xb61) SIGTERM (@0x396e) received by PID 14714 (TID 0x7f373505f980) from PID 14702; stack trace: @ 0x7f36f0c3a8a0 (unknown) @ 0x7ffcafbb2b62 ([vdso]+0xb61) @ 0x7f36ef7bd056 clock_gettime @ 0x7f36ec66ad0e (unknown) @ 0x7f36ec725c77 (unknown) @ 0x7f36ec6533cc (unknown) @ 0x7f36ec653529 (unknown) @ 0x7f36ec55e651 (unknown) @ 0x7f36ec6d4370 cuStreamSynchronize @ 0x7f3714a26250 cudart::cudaApiStreamSynchronize() @ 0x7f3714a6131d cudaStreamSynchronize @ 0x7f37122dd849 cuda::sync() @ 0x7f37129841df af_sync @ 0x7f3712c94247 af::sync() @ 0x5562e579f14f _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddblE3_clES2_S5_S7_S9_S9_ddbl @ 0x5562e57346b8 main @ 0x7f36ef6adb97 libc_start_main @ 0x5562e5798ffa _start Aborted at 1595750278 (unix time) try "date -d @1595750278" if you are using GNU date PC: @ 0x7ffcec1bfb62 ([vdso]+0xb61) SIGTERM (@0x396e) received by PID 14718 (TID 0x7f30b0eeb980) from PID 14702; stack trace: @ 0x7f306cac68a0 (unknown) @ 0x7ffcec1bfb62 ([vdso]+0xb61) @ 0x7f306b649056 clock_gettime @ 0x7f30684f6d0e (unknown) @ 0x7f30685b1c77 (unknown) @ 0x7f30684df3cc (unknown) @ 0x7f30684df529 (unknown) @ 0x7f30683ea651 (unknown) @ 0x7f3068560370 cuStreamSynchronize @ 0x7f30908b2250 cudart::cudaApiStreamSynchronize() @ 0x7f30908ed31d cudaStreamSynchronize @ 0x7f3090904915 cuda::gemm<>() @ 0x7f308e7a2702 af_gemm @ 0x7f308e7a3a2b af_matmul @ 0x7f308eb0d544 af::matmul() @ 0x55a708a4c2f6 fl::matmul() @ 0x55a708a4cda7 _ZNSt17_Function_handlerIFvRSt6vectorIN2fl8VariableESaIS2_EERKS2_EZNS1_8matmulNTES7_S7_EUlS5_S7_E_E9_M_invokeERKSt9_Any_dataS5S7 @ 0x55a708a3cf25 fl::Variable::backward() @ 0x55a708a3d4ae fl::Variable::backward() @ 0x55a7087b412f _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddblE3_clES2_S5_S7_S9_S9_ddbl @ 0x55a7087496b8 main @ 0x7f306b539b97 __libc_start_main @ 0x55a7087adffa _start

mpirun noticed that process rank 4 with PID 0 on node 1a0e97d4a4ab exited on signal 6 (Aborted).

Things I noticed are,

The CUDA memory usage is constantly increasing to die, seeming never to free its memory.
I also found "Total native frees" is always 0. Furthermore, I profiled CUDA memory usages of 8 GPUs (with nvidia-smi) during training.
Also found a bit strange point (06:07:02 in the below) where one of the GPU's memory usage abruptly increased to its near-memory limit before OOM is raised.
On OOM situation, the log message always include: "Unable to allocate memory with native alloc for size 207618048 bytes"

06:06:23 - 22324 20304 23388 20476 20738 22358 21026 20178 06:06:27 - 22324 20304 23388 20476 20738 22358 21026 20178 06:06:31 - 22324 20304 23388 20476 20738 22358 21026 20178 06:06:35 - 22324 20304 23388 20476 20738 22358 21026 20178 06:06:38 - 22324 20304 23388 20476 20738 22358 21026 20178 06:06:42 - 22324 20304 23388 20476 20738 22358 21026 20178 06:06:46 - 22324 20304 23388 20476 20738 22358 21026 20178 06:06:50 - 22324 20304 23388 20476 20738 22358 21026 20178 06:06:54 - 22324 20304 23388 20476 20738 22358 21026 20178 06:06:58 - 22324 20304 23388 20476 20738 22358 21026 20178 06:07:02 - 22324 20304 23388 32404 20738 22358 21026 20178 06:07:05 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:09 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:12 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:16 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:19 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:23 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:26 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:30 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:33 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:37 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:40 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:44 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:47 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:50 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:54 - 22324 20304 23388 32454 20738 22358 21026 20178 06:07:57 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:01 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:04 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:08 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:11 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:15 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:18 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:22 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:25 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:29 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:32 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:36 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:39 - 22324 20304 23388 32454 20738 22358 21026 20178 06:08:43 - 25 26 23377 0 0 0 0 0

tlikhomanenko commented 4 years ago

What is your maximum target and input length in the dataset? Is it crashing on the same GPU or can happen on different ones?

cc @avidov @jacobkahn

dh7ahn commented 4 years ago

Maximum targets : 75 tokens
Input length is limited by 20 seconds (maxisz = 20000). The maximum input length is 19.99s
I noticed the maximum memory usages (over 30GB) always appear on the same gpu (4th gpu). The crashes seem to occur on the same gpu. BTW, I wonder why the memory usage is non-decreasing.

tlikhomanenko commented 4 years ago

Could you try to run training on 7 GPUs excluding the 4th one, just to make sure it is the problem with the 4th GPU.

avidov commented 4 years ago

BTW, I wonder why the memory usage is non-decreasing.

Memory only frees memory when allocation fails: CachingMemoryManager.cpp#L249

similar to garbage collection. All meory is freed at once once an allocation fails. Then we retry the allocation.

` void CachingMemoryManager::mallocWithRetry(size_t size, void** ptr) {

// Try nativeMalloc. If nativeMalloc fails, frees all non-split cached blocks // and retries.

auto& memoryInfo = getDeviceMemoryInfo(); try { ++memoryInfo.stats.totalNativeMallocs; ptr = this->deviceInterface->nativeAlloc(size); } catch (std::exception& exUnused) { try { signalMemoryCleanup(); ++memoryInfo.stats.totalNativeMallocs; ptr = this->deviceInterface->nativeAlloc(size); } catch (std::exception& ex) { `

It is important that the JIT execute at the right time, since after JIT execution there should be some memory that is available for freeing, however, JIT execution can be memory hungry.

You can play with the JIT execution timing by setting a returning true here CachingMemoryManager.cpp#L306 based on return value from getDeviceMemoryInfo();

dh7ahn commented 4 years ago

BTW, I wonder why the memory usage is non-decreasing.

Memory only frees memory when allocation fails: CachingMemoryManager.cpp#L249

similar to garbage collection. All meory is freed at once once an allocation fails. Then we retry the allocation.

` void CachingMemoryManager::mallocWithRetry(size_t size, void** ptr) {

// Try nativeMalloc. If nativeMalloc fails, frees all non-split cached blocks // and retries.

auto& memoryInfo = getDeviceMemoryInfo(); try { ++memoryInfo.stats.totalNativeMallocs; ptr = this->deviceInterface->nativeAlloc(size); } catch (std::exception& exUnused) { try { signalMemoryCleanup(); ++memoryInfo.stats.totalNativeMallocs; ptr = this->deviceInterface->nativeAlloc(size); } catch (std::exception& ex) { `

It is important that the JIT execute at the right time, since after JIT execution there should be some memory that is available for freeing, however, JIT execution can be memory hungry.

You can play with the JIT execution timing by setting a returning true here CachingMemoryManager.cpp#L306 based on return value from getDeviceMemoryInfo();

I learned retrying malloc is with the caching logic. Thank you for the kind explanation,

Could you try to run training on 7 GPUs excluding the 4th one, just to make sure it is the problem with the 4th GPU.

Regarding GPU check, I've tried, but got the same result.

... 56:24:12 - 15638 18302 16510 0 15962 17578 17670 15608 56:24:15 - 15638 18302 16510 0 15962 17578 17670 15608 56:24:19 - 15638 18302 16510 0 15962 17578 17670 15608 56:24:23 - 15638 18302 16510 0 15962 17578 17670 15608 56:24:27 - 15638 18302 16510 0 15962 17578 17670 15608 56:24:30 - 15638 18302 16510 0 15962 32424 17670 15608 56:24:34 - 15638 18302 16510 0 15962 32424 17670 15608 56:24:37 - 15638 18302 16510 0 15962 32424 17670 15608 56:24:41 - 15638 18302 16510 0 15962 32424 17670 15608 56:24:44 - 15638 18302 16510 0 15962 32424 17670 15608 56:24:48 - 15638 18302 16510 0 15962 32424 17670 15608 56:24:51 - 15638 18302 16510 0 15962 32424 17670 15608 56:24:55 - 15638 18302 16510 0 15962 32424 17670 15608 56:24:58 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:02 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:05 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:09 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:12 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:16 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:19 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:23 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:26 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:30 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:33 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:37 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:40 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:44 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:47 - 15638 18302 16510 0 15962 32424 17670 15608 56:25:51 - 0 0 0 0 0 0 0 0 56:25:55 - 0 0 0 0 0 0 0 0

Even when excluding the 4th GPU, another GPU memory was suddenly increased to 32424MB from 17578MB, and then OOM was raised, at the very early stage of the training. I think it's not a problem of a specific GPU. What is the cause of the sudden increase in the memory usage (up to the memory limit), only for one of several GPUs? Kind of memory management policy or feature? Possibly.. can it be related to the cause of OOM exception?

One thing I found: trying several settings, I found batch size of 8 works well without the above problem, for several epochs, though the memory usage seems not to be efficient.

03:30:25 - 12780 11940 14338 15344 13454 15504 12918 12740 03:30:28 - 12780 11940 14338 15344 13454 15504 12918 12740 03:30:32 - 12780 11940 14338 15344 13454 15504 12918 12740 03:30:35 - 12780 11940 14338 15344 13454 15504 12918 12740 03:30:39 - 12780 11940 14338 15344 13454 15504 12918 12740 03:30:43 - 12780 11940 14338 15344 13454 15504 12918 12740 03:30:46 - 12780 11940 14338 15344 13454 15504 12918 12740 03:30:50 - 12780 11940 14338 15344 13454 15504 12918 12740 03:30:54 - 12780 11940 14338 15344 13454 15504 12918 12740 03:30:57 - 12780 11940 14338 15344 13454 15504 25424 12740 03:31:01 - 12780 11940 14338 15344 13454 15504 25424 12740 03:31:05 - 12780 11940 14338 15344 13454 15504 25424 12740 03:31:08 - 12780 11940 14338 15344 13454 15504 25424 12740 03:31:12 - 12780 11940 14338 15344 13454 15504 25424 12740 03:31:15 - 12780 11940 14338 15344 13454 15504 25424 12740 03:31:19 - 12780 11940 14338 15344 13454 15504 25424 12740 03:31:23 - 12780 11940 14338 15344 13454 15504 25424 12740 03:31:26 - 12780 11940 14338 15344 13454 15504 25424 12740 03:31:30 - 12780 11940 14338 15344 13454 15504 25424 12740 03:31:33 - 12780 11940 14338 15344 13454 15504 25424 12740 03:31:37 - 13964 13124 14338 15344 13454 15504 25424 12740 03:31:41 - 13964 13124 14338 15344 13454 15504 25424 12740 03:31:44 - 13964 13124 14338 15344 13454 15504 25424 12740 03:31:48 - 13964 13124 14338 15344 13454 15504 25424 12740 03:31:51 - 13964 13124 14338 15344 13454 15504 25424 12740 03:31:55 - 13964 13124 14338 15344 13454 15504 25424 12740 03:31:59 - 13964 13124 14338 15344 13454 15504 25424 12740 03:32:02 - 13964 13124 14338 15344 13454 15504 25424 12740 03:32:06 - 13964 13124 14338 15344 13454 15504 25424 12740 03:32:10 - 13964 13124 14338 15344 13454 15504 25424 12740 03:32:13 - 13964 13124 14338 15344 13454 15504 25424 12740 03:32:17 - 13964 13124 14338 15344 13454 15504 25424 12740 03:32:21 - 13964 13124 14338 15344 13454 15504 25424 12740 03:32:24 - 13964 13124 14338 15344 13454 15504 25424 12740 03:32:28 - 13964 13124 14338 15344 13454 15504 25424 12740 03:32:32 - 13964 13124 14338 15344 13454 15504 25424 12740 (same logs follow)

Some observations are:

Even in this case, as in the problematic case I stated previously, there was a sudden increase of one GPU memory (from 12918MB to 25424MB)
The memory usage remains the same as the last line for several epochs,. (no OOM so far)
The memory usage is never efficient, considering the 7 GPU's memory usages, except one (25424MB), are below the half of GPU memory equipped (32GB)

Any help or comment? :(

tlikhomanenko commented 4 years ago

Data input varies in the input size. It could be large input/target sizes in the batch which cause this. Can you plot stats on input size and target size in your data? You can try to avoid problem by filtering samples with large input/target size using --maxisz and --maxtsz. Could you check this?

flashlight / wav2letter

Help: OOM error (ArrayFire exception) during training a tds-s2s model #766

mpirun noticed that process rank 4 with PID 0 on node 1a0e97d4a4ab exited on signal 6 (Aborted).