facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.22k stars 6.38k forks source link

Wav2vec2 error after validation when training : terminate called after throwing an instance of 'c10::Error' #3674

Open Jxu-Thu opened 3 years ago

Jxu-Thu commented 3 years ago

🐛 Bug

Wav2vec2 error after validation when training

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

TASK=audio_pretraining CRITERION=ctc

MAX_TOKENS=960000

about 16.6 sentences

Optimization

TOTAL_UPDATES=80000 UPDATE_FREQ=4 #for 2 ka? LR=0.00003

LR Scheduler

WARMUP_UPDATES=8000 HOLD_STEPS=32000 DECAY_STEPS=40000 FINAL_LR_SCALE=0.05

Model

ARCH=wav2vec_ctc MASK_PROB=0.65 MASK_CHANNLE_PROB=0.5 MASK_CHANNLE_LEN=64 MASK_LEN=10 FREEZE_FINETUNE_UPDATES=0

SEED=2337

DATA_DIR=/path/checkpoint_wav2vec/data/clean_100h SAVE_DIR=/path/checkpoint_wav2vec/finetune_libri100h

RESUME_PATH=/path/checkpoint_wav2vec/wav2vec/checkpoint_last.pt

mkdir -p $SAVE_DIR

python $dist_config train.py $DATA_DIR --save-dir $SAVE_DIR --fp16 \ --post-process letter --valid-subset valid --no-epoch-checkpoints \ --best-checkpoint-metric wer --num-workers 4 \ --max-update ${TOTAL_UPDATES} --sentence-avg \ --task ${TASK} --arch ${ARCH} \ --w2v-path ${RESUME_PATH} \ --labels ltr \ --apply-mask --mask-selection static --mask-other 0 --mask-length $MASK_LEN --mask-prob $MASK_PROB \ --layerdrop 0.1 --mask-channel-selection static --mask-channel-other 0 \ --mask-channel-length $MASK_CHANNLE_LEN --mask-channel-prob $MASK_CHANNLE_PROB \ --zero-infinity --feature-grad-mult 0.0 --freeze-finetune-updates ${FREEZE_FINETUNE_UPDATES} \ --validate-after-updates 10000 --optimizer adam \ --adam-betas '(0.9, 0.98)' --adam-eps 1e-08 --lr $LR \ --lr-scheduler tri_stage --warmup-steps ${WARMUP_UPDATES} --hold-steps ${HOLD_STEPS} \ --skip-invalid-size-inputs-valid-test \ --update-freq $UPDATE_FREQ \ --decay-steps ${DECAY_STEPS} --final-lr-scale $FINAL_LR_SCALE --final-dropout 0.0 \ --dropout 0.0 --activation-dropout 0.1 \ --criterion ${CRITERION} \ --distributed-no-spawn \ --attention-dropout 0.0 --max-tokens ${MAX_TOKENS} \ --seed ${SEED} --log-format json --log-interval 50 --ddp-backend no_c10d | tee -a $SAVE_DIR/log.txt

Environment

Additional context

2021-07-01 12:23:37 | INFO | fairseq.utils | CUDA enviroments for all 8 workers 2021-07-01 12:23:37 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-07-01 12:23:37 | INFO | fairseq.utils | rank 1: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-07-01 12:23:37 | INFO | fairseq.utils | rank 2: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-07-01 12:23:37 | INFO | fairseq.utils | rank 3: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-07-01 12:23:37 | INFO | fairseq.utils | rank 4: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-07-01 12:23:37 | INFO | fairseq.utils | rank 5: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-07-01 12:23:37 | INFO | fairseq.utils | rank 6: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-07-01 12:23:37 | INFO | fairseq.utils | rank 7: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-07-01 12:23:37 | INFO | fairseq.utils | CUDA enviroments for all 8 workers 2021-07-01 12:23:37 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs) 2021-07-01 12:23:37 | INFO | fairseq_cli.train | max tokens per device = 1280000 and max sentences per device = None 2021-07-01 12:23:37 | INFO | fairseq.trainer | Preparing to load checkpoint /path/checkpoint_wav2vec/finetune_libri100h/checkpoint_last.pt 2021-07-01 12:23:37 | INFO | fairseq.trainer | No existing checkpoint found /path/checkpoint_wav2vec/finetune_libri100h/checkpoint_last.pt 2021-07-01 12:23:37 | INFO | fairseq.trainer | loading train data for epoch 1 2021-07-01 12:23:37 | INFO | fairseq.data.audio.raw_audio_dataset | loaded 28539, skipped 0 samples 2021-07-01 12:23:38 | INFO | fairseq.optim.adam | using FusedAdam 2021-07-01 12:23:38 | INFO | fairseq.trainer | begin training epoch 1 2021-07-01 12:23:38 | INFO | fairseq_cli.train | Start iterating over samples 2021-07-01 12:23:41 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0 2021-07-01 12:23:42 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0 2021-07-01 12:23:42 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 16.0 2021-07-01 12:24:00 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0 2021-07-01 12:24:12 | INFO | train_inner | {"epoch": 1, "update": 0.257, "loss": "2340.18", "ntokens": "25355.1", "nsentences": "134.2", "nll_loss": "12.386", "wps": "42514.1", "ups": "1.68", "wpb": "25355.1", "bsz": "134.2", "num_updates": "50", "lr": "4.85625e-07", "gnorm": "1645.81", "loss_scale": "8", "train_wall": "33", "gb_free": "11.5", "wall": "35"} 2021-07-01 12:24:15 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 4.0 2021-07-01 12:24:42 | INFO | train_inner | {"epoch": 1, "update": 0.5, "loss": "2146.24", "ntokens": "25324.3", "nsentences": "136.92", "nll_loss": "11.604", "wps": "42276.3", "ups": "1.67", "wpb": "25324.3", "bsz": "136.9", "num_updates": "100", "lr": "6.7125e-07", "gnorm": "2119.39", "loss_scale": "4", "train_wall": "30", "gb_free": "11", "wall": "65"} 2021-07-01 12:25:11 | INFO | train_inner | {"epoch": 1, "update": 0.738, "loss": "1917.67", "ntokens": "25309", "nsentences": "135.44", "nll_loss": "10.262", "wps": "43470.7", "ups": "1.72", "wpb": "25309", "bsz": "135.4", "num_updates": "150", "lr": "8.56875e-07", "gnorm": "2846.55", "loss_scale": "4", "train_wall": "29", "gb_free": "10.7", "wall": "94"} 2021-07-01 12:25:40 | INFO | train_inner | {"epoch": 1, "update": 0.976, "loss": "1447.75", "ntokens": "25281.2", "nsentences": "139.42", "nll_loss": "7.984", "wps": "43491.5", "ups": "1.72", "wpb": "25281.2", "bsz": "139.4", "num_updates": "200", "lr": "1.0425e-06", "gnorm": "2930.65", "loss_scale": "4", "train_wall": "29", "gb_free": "10.9", "wall": "123"} 2021-07-01 12:25:43 | INFO | fairseq.checkpoint_utils | Preparing to save checkpoint for epoch 1 @ 205 updates 2021-07-01 12:25:43 | INFO | fairseq.trainer | Saving checkpoint to /path/checkpoint_wav2vec/finetune_libri100h/checkpoint_last.pt terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f1dbeb788b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f1dbedcaef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f1dbeb63b7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f1e0d55b902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: _PyObject_GC_New + 0x419 (0x552d89 in /usr/bin/python) frame #7: PyTraceBack_Here + 0x1d1 (0x5566b1 in /usr/bin/python) frame #8: _PyEval_EvalFrameDefault + 0x3de8 (0x57c5a8 in /usr/bin/python) frame #9: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #10: _PyEval_EvalFrameDefault + 0x619 (0x578dd9 in /usr/bin/python) frame #11: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #12: _PyFunction_Vectorcall + 0x247 (0x602bd7 in /usr/bin/python) frame #13: _PyEval_EvalFrameDefault + 0x619 (0x578dd9 in /usr/bin/python) frame #14: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #15: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #16: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #17: /usr/bin/python() [0x4ffa96] frame #18: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #19: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #20: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #21: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #22: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #23: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #24: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #25: /usr/bin/python() [0x4ffa96] frame #26: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #27: /usr/bin/python() [0x645e55] frame #28: /usr/bin/python() [0x65f7f4] frame #29: + 0x76db (0x7f1e118b56db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #30: clone + 0x3f (0x7f1e11bee88f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f51310c08b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f5131312ef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f51310abb7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f517faa3902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: _PyObject_GC_New + 0x419 (0x552d89 in /usr/bin/python) frame #7: /usr/bin/python() [0x5b54bf] frame #8: PyObject_GetIter + 0x13 (0x507183 in /usr/bin/python) frame #9: _PyEval_EvalFrameDefault + 0x14fe (0x579cbe in /usr/bin/python) frame #10: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #11: _PyFunction_Vectorcall + 0x247 (0x602bd7 in /usr/bin/python) frame #12: /usr/bin/python() [0x600500] frame #13: _PyObject_CallMethodIdObjArgs + 0xee (0x600dae in /usr/bin/python) frame #14: PyImport_ImportModuleLevelObject + 0x382 (0x565002 in /usr/bin/python) frame #15: _PyEval_EvalFrameDefault + 0x2afd (0x57b2bd in /usr/bin/python) frame #16: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #17: /usr/bin/python() [0x600500] frame #18: PyObject_CallFunctionObjArgs + 0x8e (0x6007ee in /usr/bin/python) frame #19: /usr/bin/python() [0x53cb41] frame #20: /usr/bin/python() [0x5431bc] frame #21: /usr/bin/python() [0x541d1c] frame #22: /usr/bin/python() [0x540828] frame #23: /usr/bin/python() [0x542579] frame #24: /usr/bin/python() [0x542f79] frame #25: /usr/bin/python() [0x542fd1] frame #26: /usr/bin/python() [0x541d1c] frame #27: /usr/bin/python() [0x543926] frame #28: /usr/bin/python() [0x64f68b] frame #29: /usr/bin/python() [0x4fb1ff] frame #30: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #31: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #32: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #33: /usr/bin/python() [0x4ff9e6] frame #34: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #35: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #36: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #37: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #38: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #39: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #40: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #41: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #42: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #43: /usr/bin/python() [0x4ffa96] frame #44: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #45: /usr/bin/python() [0x645e55] frame #46: /usr/bin/python() [0x65f7f4] frame #47: + 0x76db (0x7f5183dfd6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #48: clone + 0x3f (0x7f518413688f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f1dbeb788b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f1dbedcaef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f1dbeb63b7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f1e0d55b902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: _PyObject_GC_New + 0x419 (0x552d89 in /usr/bin/python) frame #7: /usr/bin/python() [0x5af6df] frame #8: /usr/bin/python() [0x5b1172] frame #9: _PyEval_EvalFrameDefault + 0x480 (0x578c40 in /usr/bin/python) frame #10: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #11: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #12: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #13: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #14: _PyObject_FastCallDict + 0x4a (0x60261a in /usr/bin/python) frame #15: /usr/bin/python() [0x5b034b] frame #16: _PyObject_MakeTpCall + 0x28f (0x5fff6f in /usr/bin/python) frame #17: _PyEval_EvalFrameDefault + 0x5553 (0x57dd13 in /usr/bin/python) frame #18: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #19: _PyEval_EvalFrameDefault + 0x619 (0x578dd9 in /usr/bin/python) frame #20: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #21: _PyFunction_Vectorcall + 0x247 (0x602bd7 in /usr/bin/python) frame #22: _PyEval_EvalFrameDefault + 0x619 (0x578dd9 in /usr/bin/python) frame #23: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #24: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #25: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #26: /usr/bin/python() [0x4ffa96] frame #27: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #28: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #29: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #30: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #31: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #32: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #33: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #34: /usr/bin/python() [0x4ffa96] frame #35: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #36: /usr/bin/python() [0x645e55] frame #37: /usr/bin/python() [0x65f7f4] frame #38: + 0x76db (0x7f1e118b56db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #39: clone + 0x3f (0x7f1e11bee88f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f51310c08b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f5131312ef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f51310abb7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f517faa3902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: /usr/bin/python() [0x55331c] frame #7: PyTuple_New + 0xe1 (0x5b44f1 in /usr/bin/python) frame #8: _PyEval_EvalFrameDefault + 0xfd1 (0x579791 in /usr/bin/python) frame #9: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #10: _PyFunction_Vectorcall + 0x247 (0x602bd7 in /usr/bin/python) frame #11: /usr/bin/python() [0x5b0529] frame #12: _PyObject_MakeTpCall + 0x1ed (0x5ffecd in /usr/bin/python) frame #13: _PyEval_EvalFrameDefault + 0x5b9e (0x57e35e in /usr/bin/python) frame #14: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #15: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #16: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #17: /usr/bin/python() [0x600500] frame #18: PyObject_CallFunctionObjArgs + 0x8e (0x6007ee in /usr/bin/python) frame #19: /usr/bin/python() [0x53cb41] frame #20: /usr/bin/python() [0x5431bc] frame #21: /usr/bin/python() [0x541d1c] frame #22: /usr/bin/python() [0x540828] frame #23: /usr/bin/python() [0x542579] frame #24: /usr/bin/python() [0x542f79] frame #25: /usr/bin/python() [0x541d1c] frame #26: /usr/bin/python() [0x543926] frame #27: /usr/bin/python() [0x64f68b] frame #28: /usr/bin/python() [0x4fb1ff] frame #29: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #30: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #31: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #32: /usr/bin/python() [0x4ff9e6] frame #33: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #34: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #35: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #36: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #37: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #38: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #39: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #40: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #41: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #42: /usr/bin/python() [0x4ffa96] frame #43: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #44: /usr/bin/python() [0x645e55] frame #45: /usr/bin/python() [0x65f7f4] frame #46: + 0x76db (0x7f5183dfd6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #47: clone + 0x3f (0x7f518413688f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f14d243c8b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f14d268eef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f14d2427b7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f1520e1f902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: /usr/bin/python() [0x55331c] frame #7: _PyObject_MakeTpCall + 0x411 (0x6000f1 in /usr/bin/python) frame #8: _PyEval_EvalFrameDefault + 0x5553 (0x57dd13 in /usr/bin/python) frame #9: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #10: /usr/bin/python() [0x600500] frame #11: PyObject_CallFunctionObjArgs + 0x8e (0x6007ee in /usr/bin/python) frame #12: /usr/bin/python() [0x53cb41] frame #13: /usr/bin/python() [0x5431bc] frame #14: /usr/bin/python() [0x542f79] frame #15: /usr/bin/python() [0x542fd1] frame #16: /usr/bin/python() [0x541d1c] frame #17: /usr/bin/python() [0x543926] frame #18: /usr/bin/python() [0x64f68b] frame #19: /usr/bin/python() [0x4fb1ff] frame #20: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #21: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #22: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #23: /usr/bin/python() [0x4ff9e6] frame #24: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #25: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #26: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #27: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #28: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #29: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #30: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #31: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #32: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #33: /usr/bin/python() [0x4ffa96] frame #34: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #35: /usr/bin/python() [0x645e55] frame #36: /usr/bin/python() [0x65f7f4] frame #37: + 0x76db (0x7f15251796db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #38: clone + 0x3f (0x7f15254b288f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f14d243c8b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f14d268eef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f14d2427b7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f1520e1f902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: /usr/bin/python() [0x55331c] frame #7: PyTuple_New + 0xe1 (0x5b44f1 in /usr/bin/python) frame #8: _PyEval_EvalFrameDefault + 0xfd1 (0x579791 in /usr/bin/python) frame #9: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #10: _PyFunction_Vectorcall + 0x247 (0x602bd7 in /usr/bin/python) frame #11: /usr/bin/python() [0x5b0529] frame #12: _PyObject_MakeTpCall + 0x1ed (0x5ffecd in /usr/bin/python) frame #13: _PyEval_EvalFrameDefault + 0x5b9e (0x57e35e in /usr/bin/python) frame #14: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #15: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #16: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #17: /usr/bin/python() [0x600500] frame #18: PyObject_CallFunctionObjArgs + 0x8e (0x6007ee in /usr/bin/python) frame #19: /usr/bin/python() [0x53cb41] frame #20: /usr/bin/python() [0x5431bc] frame #21: /usr/bin/python() [0x541d1c] frame #22: /usr/bin/python() [0x540828] frame #23: /usr/bin/python() [0x542579] frame #24: /usr/bin/python() [0x542f79] frame #25: /usr/bin/python() [0x541d1c] frame #26: /usr/bin/python() [0x543926] frame #27: /usr/bin/python() [0x64f68b] frame #28: /usr/bin/python() [0x4fb1ff] frame #29: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #30: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #31: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #32: /usr/bin/python() [0x4ff9e6] frame #33: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #34: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #35: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #36: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #37: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #38: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #39: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #40: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #41: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #42: /usr/bin/python() [0x4ffa96] frame #43: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #44: /usr/bin/python() [0x645e55] frame #45: /usr/bin/python() [0x65f7f4] frame #46: + 0x76db (0x7f15251796db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #47: clone + 0x3f (0x7f15254b288f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f1dbeb788b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f1dbedcaef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f1dbeb63b7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f1e0d55b902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: /usr/bin/python() [0x55331c] frame #7: _PyObject_MakeTpCall + 0x411 (0x6000f1 in /usr/bin/python) frame #8: _PyEval_EvalFrameDefault + 0x5553 (0x57dd13 in /usr/bin/python) frame #9: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #10: /usr/bin/python() [0x600500] frame #11: PyObject_CallFunctionObjArgs + 0x8e (0x6007ee in /usr/bin/python) frame #12: /usr/bin/python() [0x53cb41] frame #13: /usr/bin/python() [0x5431bc] frame #14: /usr/bin/python() [0x543025] frame #15: /usr/bin/python() [0x541d1c] frame #16: /usr/bin/python() [0x543926] frame #17: /usr/bin/python() [0x64f68b] frame #18: /usr/bin/python() [0x4fb1ff] frame #19: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #20: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #21: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #22: /usr/bin/python() [0x4ff9e6] frame #23: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #24: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #25: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #26: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #27: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #28: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #29: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #30: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #31: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #32: /usr/bin/python() [0x4ffa96] frame #33: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #34: /usr/bin/python() [0x645e55] frame #35: /usr/bin/python() [0x65f7f4] frame #36: + 0x76db (0x7f1e118b56db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #37: clone + 0x3f (0x7f1e11bee88f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f615ebe08b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f615ee32ef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f615ebcbb7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f61ad5c3902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: /usr/bin/python() [0x55331c] frame #7: _PyEval_EvalCodeWithName + 0x115e (0x5774ee in /usr/bin/python) frame #8: _PyFunction_Vectorcall + 0x247 (0x602bd7 in /usr/bin/python) frame #9: _PyEval_EvalFrameDefault + 0x619 (0x578dd9 in /usr/bin/python) frame #10: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #11: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #12: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #13: /usr/bin/python() [0x4ffa96] frame #14: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #15: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #16: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #17: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #18: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #19: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #20: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #21: /usr/bin/python() [0x4ffa96] frame #22: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #23: /usr/bin/python() [0x645e55] frame #24: /usr/bin/python() [0x65f7f4] frame #25: + 0x76db (0x7f61b191d6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #26: clone + 0x3f (0x7f61b1c5688f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fdf073378b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7fdf07589ef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fdf07322b7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7fdf55d1a902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: _PyObject_GC_New + 0x419 (0x552d89 in /usr/bin/python) frame #7: PyTraceBack_Here + 0x1d1 (0x5566b1 in /usr/bin/python) frame #8: _PyEval_EvalFrameDefault + 0x3de8 (0x57c5a8 in /usr/bin/python) frame #9: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #10: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #11: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #12: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #13: _PyObject_FastCallDict + 0x4a (0x60261a in /usr/bin/python) frame #14: /usr/bin/python() [0x5b034b] frame #15: _PyObject_MakeTpCall + 0x28f (0x5fff6f in /usr/bin/python) frame #16: _PyEval_EvalFrameDefault + 0x5553 (0x57dd13 in /usr/bin/python) frame #17: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #18: _PyEval_EvalFrameDefault + 0x619 (0x578dd9 in /usr/bin/python) frame #19: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #20: _PyFunction_Vectorcall + 0x247 (0x602bd7 in /usr/bin/python) frame #21: _PyEval_EvalFrameDefault + 0x619 (0x578dd9 in /usr/bin/python) frame #22: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #23: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #24: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #25: /usr/bin/python() [0x4ffa96] frame #26: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #27: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #28: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #29: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #30: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #31: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #32: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #33: /usr/bin/python() [0x4ffa96] frame #34: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #35: /usr/bin/python() [0x645e55] frame #36: /usr/bin/python() [0x65f7f4] frame #37: + 0x76db (0x7fdf5a0746db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #38: clone + 0x3f (0x7fdf5a3ad88f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f615ebe08b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f615ee32ef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f615ebcbb7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f61ad5c3902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: /usr/bin/python() [0x55331c] frame #7: PyTuple_New + 0xe1 (0x5b44f1 in /usr/bin/python) frame #8: _PyEval_EvalFrameDefault + 0xfd1 (0x579791 in /usr/bin/python) frame #9: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #10: _PyFunction_Vectorcall + 0x247 (0x602bd7 in /usr/bin/python) frame #11: /usr/bin/python() [0x5b0529] frame #12: _PyObject_MakeTpCall + 0x1ed (0x5ffecd in /usr/bin/python) frame #13: _PyEval_EvalFrameDefault + 0x5b9e (0x57e35e in /usr/bin/python) frame #14: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #15: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #16: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #17: /usr/bin/python() [0x600500] frame #18: PyObject_CallFunctionObjArgs + 0x8e (0x6007ee in /usr/bin/python) frame #19: /usr/bin/python() [0x53cb41] frame #20: /usr/bin/python() [0x5431bc] frame #21: /usr/bin/python() [0x541d1c] frame #22: /usr/bin/python() [0x540828] frame #23: /usr/bin/python() [0x542579] frame #24: /usr/bin/python() [0x542f79] frame #25: /usr/bin/python() [0x541d1c] frame #26: /usr/bin/python() [0x543926] frame #27: /usr/bin/python() [0x64f68b] frame #28: /usr/bin/python() [0x4fb1ff] frame #29: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #30: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #31: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #32: /usr/bin/python() [0x4ff9e6] frame #33: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #34: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #35: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #36: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #37: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #38: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #39: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #40: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #41: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #42: /usr/bin/python() [0x4ffa96] frame #43: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #44: /usr/bin/python() [0x645e55] frame #45: /usr/bin/python() [0x65f7f4] frame #46: + 0x76db (0x7f61b191d6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #47: clone + 0x3f (0x7f61b1c5688f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fdf073378b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1070 (0x7fdf07589ef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fdf07322b7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7fdf55d1a902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: PyType_GenericAlloc + 0x4f5 (0x5b64d5 in /usr/bin/python) frame #7: THPSize_NewFromSizes(int, long const) + 0x23 (0x7fdf55c1c773 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: THPSize_New(at::Tensor const&) + 0x161 (0x7fdf55c1caa1 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #9: + 0x29ae98 (0x7fdf559b7e98 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #10: /usr/bin/python() [0x4fcdc2] frame #11: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #12: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #13: /usr/bin/python() [0x600500] frame #14: PyObject_CallFunctionObjArgs + 0x8e (0x6007ee in /usr/bin/python) frame #15: /usr/bin/python() [0x53cb41] frame #16: /usr/bin/python() [0x5431bc] frame #17: /usr/bin/python() [0x543025] frame #18: /usr/bin/python() [0x541d1c] frame #19: /usr/bin/python() [0x543926] frame #20: /usr/bin/python() [0x64f68b] frame #21: /usr/bin/python() [0x4fb1ff] frame #22: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #23: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #24: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #25: /usr/bin/python() [0x4ff9e6] frame #26: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #27: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #28: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #29: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #30: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #31: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #32: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #33: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #34: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #35: /usr/bin/python() [0x4ffa96] frame #36: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #37: /usr/bin/python() [0x645e55] frame #38: /usr/bin/python() [0x65f7f4] frame #39: + 0x76db (0x7fdf5a0746db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #40: clone + 0x3f (0x7fdf5a3ad88f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f51310c08b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f5131312ef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f51310abb7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f517faa3902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: PyType_GenericAlloc + 0x4f5 (0x5b64d5 in /usr/bin/python) frame #7: /usr/bin/python() [0x5fad71] frame #8: /usr/bin/python() [0x5b2df5] frame #9: PyObject_Call + 0x5d (0x5ffafd in /usr/bin/python) frame #10: _PyErr_NormalizeException + 0xc5 (0x56a125 in /usr/bin/python) frame #11: _PyEval_EvalFrameDefault + 0x5f52 (0x57e712 in /usr/bin/python) frame #12: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #13: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #14: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #15: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #16: _PyObject_FastCallDict + 0x4a (0x60261a in /usr/bin/python) frame #17: /usr/bin/python() [0x5b034b] frame #18: _PyObject_MakeTpCall + 0x28f (0x5fff6f in /usr/bin/python) frame #19: _PyEval_EvalFrameDefault + 0x5553 (0x57dd13 in /usr/bin/python) frame #20: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #21: _PyEval_EvalFrameDefault + 0x619 (0x578dd9 in /usr/bin/python) frame #22: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #23: _PyFunction_Vectorcall + 0x247 (0x602bd7 in /usr/bin/python) frame #24: _PyEval_EvalFrameDefault + 0x619 (0x578dd9 in /usr/bin/python) frame #25: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #26: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #27: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #28: /usr/bin/python() [0x4ffa96] frame #29: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #30: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #31: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #32: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #33: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #34: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #35: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #36: /usr/bin/python() [0x4ffa96] frame #37: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #38: /usr/bin/python() [0x645e55] frame #39: /usr/bin/python() [0x65f7f4] frame #40: + 0x76db (0x7f5183dfd6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #41: clone + 0x3f (0x7f518413688f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f14d243c8b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f14d268eef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f14d2427b7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f1520e1f902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: /usr/bin/python() [0x55331c] frame #7: _PyEval_EvalCodeWithName + 0x115e (0x5774ee in /usr/bin/python) frame #8: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #9: _PyObject_FastCallDict + 0x4a (0x60261a in /usr/bin/python) frame #10: /usr/bin/python() [0x5b034b] frame #11: _PyObject_MakeTpCall + 0x28f (0x5fff6f in /usr/bin/python) frame #12: _PyEval_EvalFrameDefault + 0x5553 (0x57dd13 in /usr/bin/python) frame #13: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #14: _PyEval_EvalFrameDefault + 0x619 (0x578dd9 in /usr/bin/python) frame #15: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #16: _PyFunction_Vectorcall + 0x247 (0x602bd7 in /usr/bin/python) frame #17: _PyEval_EvalFrameDefault + 0x619 (0x578dd9 in /usr/bin/python) frame #18: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #19: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #20: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #21: /usr/bin/python() [0x4ffa96] frame #22: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #23: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #24: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #25: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #26: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #27: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #28: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #29: /usr/bin/python() [0x4ffa96] frame #30: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #31: /usr/bin/python() [0x645e55] frame #32: /usr/bin/python() [0x65f7f4] frame #33: + 0x76db (0x7f15251796db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #34: clone + 0x3f (0x7f15254b288f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fdf073378b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1070 (0x7fdf07589ef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fdf07322b7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7fdf55d1a902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: PyType_GenericAlloc + 0x4f5 (0x5b64d5 in /usr/bin/python) frame #7: THPSize_NewFromSizes(int, long const) + 0x23 (0x7fdf55c1c773 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: THPSize_New(at::Tensor const&) + 0x161 (0x7fdf55c1caa1 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #9: + 0x29ae98 (0x7fdf559b7e98 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #10: /usr/bin/python() [0x4fcdc2] frame #11: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #12: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #13: /usr/bin/python() [0x600500] frame #14: PyObject_CallFunctionObjArgs + 0x8e (0x6007ee in /usr/bin/python) frame #15: /usr/bin/python() [0x53cb41] frame #16: /usr/bin/python() [0x5431bc] frame #17: /usr/bin/python() [0x543025] frame #18: /usr/bin/python() [0x541d1c] frame #19: /usr/bin/python() [0x543926] frame #20: /usr/bin/python() [0x64f68b] frame #21: /usr/bin/python() [0x4fb1ff] frame #22: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #23: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #24: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #25: /usr/bin/python() [0x4ff9e6] frame #26: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #27: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #28: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #29: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #30: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #31: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #32: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #33: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #34: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #35: /usr/bin/python() [0x4ffa96] frame #36: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #37: /usr/bin/python() [0x645e55] frame #38: /usr/bin/python() [0x65f7f4] frame #39: + 0x76db (0x7fdf5a0746db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #40: clone + 0x3f (0x7fdf5a3ad88f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f51c72b28b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f51c7504ef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f51c729db7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f5215c95902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x5fd9b6 (0x7f5215c959b6 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: /usr/bin/python() [0x5b3a21] frame #6: PyDict_Clear + 0xef (0x5cfa9f in /usr/bin/python) frame #7: /usr/bin/python() [0x43566c] frame #8: /usr/bin/python() [0x4d7cc6] frame #9: /usr/bin/python() [0x55331c] frame #10: PyTuple_New + 0xe1 (0x5b44f1 in /usr/bin/python) frame #11: + 0x299239 (0x7f5215931239 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #12: /usr/bin/python() [0x4fcdc2] frame #13: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #14: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #15: /usr/bin/python() [0x600500] frame #16: PyObject_CallFunctionObjArgs + 0x8e (0x6007ee in /usr/bin/python) frame #17: /usr/bin/python() [0x53cb41] frame #18: /usr/bin/python() [0x5431bc] frame #19: /usr/bin/python() [0x543025] frame #20: /usr/bin/python() [0x541d1c] frame #21: /usr/bin/python() [0x543926] frame #22: /usr/bin/python() [0x64f68b] frame #23: /usr/bin/python() [0x4fb1ff] frame #24: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #25: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #26: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #27: /usr/bin/python() [0x4ff9e6] frame #28: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #29: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #30: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #31: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #32: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #33: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #34: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #35: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #36: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #37: /usr/bin/python() [0x4ffa96] frame #38: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #39: /usr/bin/python() [0x645e55] frame #40: /usr/bin/python() [0x65f7f4] frame #41: + 0x76db (0x7f5219fef6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #42: clone + 0x3f (0x7f521a32888f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f1dbeb788b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f1dbedcaef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f1dbeb63b7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f1e0d55b902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: /usr/bin/python() [0x55331c] frame #7: PyStructSequence_New + 0x5a (0x5c415a in /usr/bin/python) frame #8: /usr/bin/python() [0x51d33d] frame #9: /usr/bin/python() [0x632ef0] frame #10: /usr/bin/python() [0x5d1dc3] frame #11: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #12: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #13: _PyEval_EvalFrameDefault + 0x619 (0x578dd9 in /usr/bin/python) frame #14: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #15: /usr/bin/python() [0x600500] frame #16: PyObject_CallFunctionObjArgs + 0x8e (0x6007ee in /usr/bin/python) frame #17: /usr/bin/python() [0x53cb41] frame #18: /usr/bin/python() [0x5431bc] frame #19: /usr/bin/python() [0x541d1c] frame #20: /usr/bin/python() [0x540828] frame #21: /usr/bin/python() [0x542579] frame #22: /usr/bin/python() [0x542fd1] frame #23: /usr/bin/python() [0x542fd1] frame #24: /usr/bin/python() [0x541d1c] frame #25: /usr/bin/python() [0x543926] frame #26: /usr/bin/python() [0x64f68b] frame #27: /usr/bin/python() [0x4fb1ff] frame #28: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #29: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #30: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #31: /usr/bin/python() [0x4ff9e6] frame #32: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #33: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #34: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #35: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #36: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #37: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #38: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #39: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #40: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #41: /usr/bin/python() [0x4ffa96] frame #42: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #43: /usr/bin/python() [0x645e55] frame #44: /usr/bin/python() [0x65f7f4] frame #45: + 0x76db (0x7f1e118b56db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #46: clone + 0x3f (0x7f1e11bee88f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f51c72b28b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f51c7504ef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f51c729db7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f5215c95902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x5fd9b6 (0x7f5215c959b6 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: /usr/bin/python() [0x5b3a21] frame #6: PyDict_Clear + 0xef (0x5cfa9f in /usr/bin/python) frame #7: /usr/bin/python() [0x43566c] frame #8: /usr/bin/python() [0x4d7cc6] frame #9: /usr/bin/python() [0x55331c] frame #10: PyTuple_New + 0xe1 (0x5b44f1 in /usr/bin/python) frame #11: _PyEval_EvalFrameDefault + 0xfd1 (0x579791 in /usr/bin/python) frame #12: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #13: _PyFunction_Vectorcall + 0x247 (0x602bd7 in /usr/bin/python) frame #14: /usr/bin/python() [0x5b0529] frame #15: _PyObject_MakeTpCall + 0x1ed (0x5ffecd in /usr/bin/python) frame #16: _PyEval_EvalFrameDefault + 0x5b9e (0x57e35e in /usr/bin/python) frame #17: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #18: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #19: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #20: /usr/bin/python() [0x600500] frame #21: PyObject_CallFunctionObjArgs + 0x8e (0x6007ee in /usr/bin/python) frame #22: /usr/bin/python() [0x53cb41] frame #23: /usr/bin/python() [0x5431bc] frame #24: /usr/bin/python() [0x541d1c] frame #25: /usr/bin/python() [0x540828] frame #26: /usr/bin/python() [0x542579] frame #27: /usr/bin/python() [0x542f79] frame #28: /usr/bin/python() [0x541d1c] frame #29: /usr/bin/python() [0x543926] frame #30: /usr/bin/python() [0x64f68b] frame #31: /usr/bin/python() [0x4fb1ff] frame #32: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #33: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #34: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #35: /usr/bin/python() [0x4ff9e6] frame #36: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #37: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #38: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #39: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #40: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #41: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #42: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #43: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #44: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #45: /usr/bin/python() [0x4ffa96] frame #46: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #47: /usr/bin/python() [0x645e55] frame #48: /usr/bin/python() [0x65f7f4] frame #49: + 0x76db (0x7f5219fef6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #50: clone + 0x3f (0x7f521a32888f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f615ebe08b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f615ee32ef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f615ebcbb7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f61ad5c3902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: PyType_GenericAlloc + 0x4f5 (0x5b64d5 in /usr/bin/python) frame #7: _PyObject_MakeTpCall + 0x170 (0x5ffe50 in /usr/bin/python) frame #8: _PyEval_EvalFrameDefault + 0x5553 (0x57dd13 in /usr/bin/python) frame #9: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #10: _PyFunction_Vectorcall + 0x247 (0x602bd7 in /usr/bin/python) frame #11: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #12: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #13: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #14: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #15: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #16: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #17: /usr/bin/python() [0x4ffa96] frame #18: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #19: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #20: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #21: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #22: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #23: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #24: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #25: /usr/bin/python() [0x4ffa96] frame #26: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #27: /usr/bin/python() [0x645e55] frame #28: /usr/bin/python() [0x65f7f4] frame #29: + 0x76db (0x7f61b191d6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #30: clone + 0x3f (0x7f61b1c5688f in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: initialization error Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f615ebe08b2 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f615ee32ef0 in /home/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f615ebcbb7d in /home/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #3: + 0x5fd902 (0x7f61ad5c3902 in /home/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: /usr/bin/python() [0x5b43fa] frame #5: /usr/bin/python() [0x4d7cc6] frame #6: _PyObject_GC_New + 0x419 (0x552d89 in /usr/bin/python) frame #7: /usr/bin/python() [0x5da528] frame #8: /usr/bin/python() [0x4fb52d] frame #9: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #10: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #11: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #12: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #13: /usr/bin/python() [0x5b9fcd] frame #14: _PyEval_EvalFrameDefault + 0x146b (0x579c2b in /usr/bin/python) frame #15: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #16: /usr/bin/python() [0x600500] frame #17: PyObject_CallFunctionObjArgs + 0x8e (0x6007ee in /usr/bin/python) frame #18: /usr/bin/python() [0x53cb41] frame #19: /usr/bin/python() [0x5431bc] frame #20: /usr/bin/python() [0x541d1c] frame #21: /usr/bin/python() [0x540828] frame #22: /usr/bin/python() [0x542579] frame #23: /usr/bin/python() [0x542f79] frame #24: /usr/bin/python() [0x541d1c] frame #25: /usr/bin/python() [0x543926] frame #26: /usr/bin/python() [0x64f68b] frame #27: /usr/bin/python() [0x4fb1ff] frame #28: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #29: _PyEval_EvalCodeWithName + 0x25c (0x5765ec in /usr/bin/python) frame #30: _PyFunction_Vectorcall + 0x442 (0x602dd2 in /usr/bin/python) frame #31: /usr/bin/python() [0x4ff9e6] frame #32: _PyEval_EvalFrameDefault + 0x53f0 (0x57dbb0 in /usr/bin/python) frame #33: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #34: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #35: _PyEval_EvalFrameDefault + 0x1c4a (0x57a40a in /usr/bin/python) frame #36: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #37: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #38: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #39: _PyEval_EvalFrameDefault + 0x88d (0x57904d in /usr/bin/python) frame #40: _PyFunction_Vectorcall + 0x19c (0x602b2c in /usr/bin/python) frame #41: /usr/bin/python() [0x4ffa96] frame #42: PyVectorcall_Call + 0x51 (0x5ff3b1 in /usr/bin/python) frame #43: /usr/bin/python() [0x645e55] frame #44: /usr/bin/python() [0x65f7f4] frame #45: + 0x76db (0x7f61b191d6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #46: clone + 0x3f (0x7f61b1c5688f in /lib/x86_64-linux-gnu/libc.so.6)

Exception in thread Thread-4: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, self._kwargs) File "/home/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get return _ForkingPickler.loads(res) File "/home/.local/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.8/multiprocessing/connection.py", line 508, in Client answer_challenge(c, authkey) File "/usr/lib/python3.8/multiprocessing/connection.py", line 751, in answer_challenge message = connection.recv_bytes(256) # reject large message File "/usr/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/usr/lib/python3.8/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer Exception in thread Thread-4: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, *self._kwargs) File "/home/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get return _ForkingPickler.loads(res) File "/home/.local/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.8/multiprocessing/connection.py", line 508, in Client answer_challenge(c, authkey) File "/usr/lib/python3.8/multiprocessing/connection.py", line 751, in answer_challenge message = connection.recv_bytes(256) # reject large message File "/usr/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/usr/lib/python3.8/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError Exception in thread Thread-4: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(self._args, self._kwargs) File "/home/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get return _ForkingPickler.loads(res) File "/home/.local/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.8/multiprocessing/connection.py", line 508, in Client answer_challenge(c, authkey) File "/usr/lib/python3.8/multiprocessing/connection.py", line 751, in answer_challenge message = connection.recv_bytes(256) # reject large message File "/usr/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/usr/lib/python3.8/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError Exception in thread Thread-4: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run Exception in thread self._target(*self._args, *self._kwargs) File "/home/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop Thread-4: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get return _ForkingPickler.loads(res) File "/home/.local/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.8/multiprocessing/connection.py", line 508, in Client self.run() File "/usr/lib/python3.8/threading.py", line 870, in run answer_challenge(c, authkey) File "/usr/lib/python3.8/multiprocessing/connection.py", line 751, in answer_challenge self._target(self._args, **self._kwargs) File "/home/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop message = connection.recv_bytes(256) # reject large message File "/usr/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get buf = self._recv_bytes(maxlength) File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes return _ForkingPickler.loads(res) File "/home/.local/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd buf = self._recv(4) File "/usr/lib/python3.8/multiprocessing/connection.py", line 383, in _recv fd = df.detach() File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach raise EOFError EOFError with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.8/multiprocessing/connection.py", line 508, in Client answer_challenge(c, authkey) File "/usr/lib/python3.8/multiprocessing/connection.py", line 756, in answer_challenge response = connection.recv_bytes(256) # reject large message File "/usr/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/usr/lib/python3.8/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer 2021-07-01 12:25:48 | INFO | fairseq.trainer | Finished saving checkpoint to /path/checkpoint_wav2vec/finetune_libri100h/checkpoint_last.pt 2021-07-01 12:25:48 | INFO | fairseq.checkpoint_utils | Saved checkpoint /path/checkpoint_wav2vec/finetune_libri100h/checkpoint_last.pt (epoch 1 @ 205 updates, score None) (writing took 5.110959745943546 seconds) 2021-07-01 12:25:48 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below) 2021-07-01 12:25:48 | INFO | train | {"epoch": 1, "train_loss": "1945.22", "train_ntokens": "25226.1", "train_nsentences": "135.902", "train_nll_loss": "10.48", "train_wps": "41139.2", "train_ups": "1.63", "train_wpb": "25226.1", "train_bsz": "135.9", "train_num_updates": "205", "train_lr": "1.06106e-06", "train_gnorm": "2389.16", "train_loss_scale": "4", "train_train_wall": "123", "train_gb_free": "11.8", "train_wall": "131"} 2021-07-01 12:25:48 | INFO | fairseq.trainer | begin training epoch 2 2021-07-01 12:25:48 | INFO | fairseq_cli.train | Start iterating over samples Traceback (most recent call last): File "/home/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/usr/lib/python3.8/queue.py", line 178, in get raise Empty _queue.Empty

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "train.py", line 14, in cli_main() File "/path/asr_pretrain/fairseq-master/fairseq_cli/train.py", line 507, in cli_main distributed_utils.call_main(cfg, main) File "/path/asr_pretrain/fairseq-master/fairseq/distributed/utils.py", line 354, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/path/asr_pretrain/fairseq-master/fairseq/distributed/utils.py", line 328, in distributed_main main(cfg, *kwargs) File "/path/asr_pretrain/fairseq-master/fairseq_cli/train.py", line 180, in main valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/usr/lib/python3.8/contextlib.py", line 75, in inner return func(args, **kwds) File "/path/asr_pretrain/fairseq-master/fairseq_cli/train.py", line 287, in train for i, samples in enumerate(progress): File "/path/asr_pretrain/fairseq-master/fairseq/logging/progress_bar.py", line 191, in iter for i, obj in enumerate(self.iterable, start=self.n): File "/path/asr_pretrain/fairseq-master/fairseq/data/iterators.py", line 56, in next x = next(self._itr) File "/path/asr_pretrain/fairseq-master/fairseq/data/iterators.py", line 509, in _chunk_iterator for x in itr: File "/path/asr_pretrain/fairseq-master/fairseq/data/iterators.py", line 56, in next x = next(self._itr) File "/path/asr_pretrain/fairseq-master/fairseq/data/iterators.py", line 637, in next raise item File "/path/asr_pretrain/fairseq-master/fairseq/data/iterators.py", line 567, in run for item in self._source: File "/home/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in next data = self._next_data() File "/home/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data idx, data = self._get_data() File "/home/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1024, in _get_data success, data = self._try_get_data() File "/home/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 5281, 5288, 5295) exited unexpectedly

jubick1337 commented 3 years ago

Getting same error on validation but in pretraining phase

Jxu-Thu commented 3 years ago

I can run the training scripts on fairseq 0.10.2. But the fairseq 0.10.2 cannot support the inference with kenlm and flashlight [I also tried wav2letter]. I tried the master branch and found it is ok for inference but cannot run for training.

jubick1337 commented 3 years ago

0.10.2 supporst KenLM and as well as other LMs

Jxu-Thu commented 3 years ago

0.10.2 supporst KenLM and as well as other LMs

It seems that it cannot support TransformerLM? And 0.10.2 version can only inference on cuda version<=10.2 since the kenlm and wav2letter should be installed with version 0.2.0.

alexeib commented 3 years ago

looks like some driver, cuda or pytorch problem, have you tried other versions? what if you run validate.py and point at your checkpoint? does it work if you train with dataset.num_workers=0?

jubick1337 commented 3 years ago

@Jxu-Thu Checkout W2lFairseqLMDecoder and FairseqLM You can use almost anything that has __call__() and get_normalized_probs()methods

Jxu-Thu commented 3 years ago

I tried cuda 10.2 and 11.0。torch1.5.1 and1.7.1。get same error

Jxu-Thu commented 3 years ago

@Jxu-Thu Checkout W2lFairseqLMDecoder and FairseqLM You can use almost anything that has __call__() and get_normalized_probs()methods

Thanks. I will try it.

ccyousa commented 3 years ago

have you solved this problem? I got a similar error like this: RuntimeError: DataLoader worker (pid(s) 2854) exited unexpectedly,, caused by _queue.Empty in my own model. And I have set num_workers=1 to avoid process block, doesn't help.

Jxu-Thu commented 3 years ago

@HellowDream hey. In fact, I do not solve the problem yet. And I switched to the old version of fairseq to avoid such a problem.

ccyousa commented 3 years ago

@Jxu-Thu thank you anyway, I will continue to try to solve this problem.

codecivi commented 2 years ago

looks like some driver, cuda or pytorch problem, have you tried other versions? what if you run validate.py and point at your checkpoint? does it work if you train with dataset.num_workers=0?

I have the same problem, num_worker=0 can work normally, how can I solve it

dannigt commented 2 years ago

Hi, I was training a different model but ran into the same error when saving checkpoints. Turns out my disk space is full. :D

weiyx16 commented 2 years ago

Occur the same problem during validation process. Setting num_worker=0 for validation part works for me. But it's still a strange bug and I don't know why...