1024er / cbert_aug

68 stars 19 forks source link

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)` (createCublasHandle at /pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:8) #5

Open lhuang9703 opened 4 years ago

lhuang9703 commented 4 years ago

Hi ,when I run your code: python cbert_finetune.py

I got the following problem:

Traceback (most recent call last): File "cbert_finetune.py", line 168, in main() File "cbert_finetune.py", line 151, in main loss.backward() File "/home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/autograd/init.py", line 100, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) (createCublasHandle at /pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:8) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f409fe5f536 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: + 0xf67ee5 (0x7f40a1222ee5 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #2: at::cuda::getCurrentCUDABlasHandle() + 0x94c (0x7f40a1223ccc in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xf5d5e1 (0x7f40a12185e1 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x14079bd (0x7f40a16c29bd in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #5: THCudaTensor_addmm + 0x5c (0x7f40a16cc56c in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #6: + 0x1053a08 (0x7f40a130ea08 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xf76dc8 (0x7f40a1231dc8 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #8: + 0x10c3ec0 (0x7f40dd807ec0 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #9: + 0x2c9b6fe (0x7f40df3df6fe in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #10: + 0x10c3ec0 (0x7f40dd807ec0 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #11: at::Tensor::mm(at::Tensor const&) const + 0xf0 (0x7f40dd3cbb70 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #12: + 0x28e6b6c (0x7f40df02ab6c in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #13: torch::autograd::generated::MmBackward::apply(std::vector<at::Tensor, std::allocator >&&) + 0x151 (0x7f40df02b971 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #14: + 0x2d89c05 (0x7f40df4cdc05 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #15: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f40df4caf03 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #16: torch::autograd::Engine::thread_main(std::shared_ptr const&, bool) + 0x3d2 (0x7f40df4cbce2 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #17: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f40df4c4359 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #18: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f40ebc034d8 in /home1/wxzuo/anaconda3/envs/ccs/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #19: + 0xb8408 (0x7f40ecab7408 in /home1/wxzuo/anaconda3/lib/libstdc++.so.6) frame #20: + 0x7e25 (0x7f41212c8e25 in /lib64/libpthread.so.0) frame #21: clone + 0x6d (0x7f41206e1bad in /lib64/libc.so.6)

Here are my enviroments: Package Version


certifi 2020.4.5.2 chardet 3.0.4 click 7.1.2 ConfigArgParse 1.2.3 cycler 0.10.0 Cython 3.0a5 dataclasses 0.7 decorator 4.1.2 dgl 0.4.3.post2 filelock 3.0.12 future 0.18.2 idna 2.9 joblib 0.15.1 kiwisolver 1.2.0 matplotlib 3.2.2 networkx 2.1 nltk 3.5 numpy 1.13.3 packaging 20.4 pandas 1.0.4 Pillow 7.1.2 pip 20.1.1 psutil 5.7.0 pycocotools 2.0 pyparsing 2.4.7 python-dateutil 2.8.1 pytz 2020.1 regex 2020.6.8 requests 2.23.0 sacremoses 0.0.43 scikit-learn 0.23.1 scipy 1.4.1 sentencepiece 0.1.91 setuptools 36.4.0 six 1.15.0 sklearn 0.0 stanfordcorenlp 3.9.1.1 threadpoolctl 2.1.0 tokenizers 0.7.0 torch 1.5.0 torchtext 0.6.0 torchvision 0.6.0 tqdm 4.46.1 transformers 2.11.0 urllib3 1.25.9 wheel 0.29.0

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176

Could please tell me how to solve this problem, thanks

smolPixel commented 3 years ago

Took a lot of attempts, but you need to use transformers==2.1.1 for it to work

wangcongcong123 commented 2 years ago

If you want to use the latest transformers, just change original_masked_lm_labels = [-1] * max_seq_length line 200 in cbert_utils.py to original_masked_lm_labels = [-100] * max_seq_length. Then here you go.