CUDA errors when using models that have been imported from HF and trained with SentenceTransformers

RobertHua96 commented 4 years ago

Hi,

Expected behaviour: When I create a SentenceTransformer model by importing in a HF model and fine tuning it with the NLI code example, it should work when encodding text.

Actual behaviour: CUDA errors occur when trying to embed text.

The pretrained models from the SentenceTransformers package are able to embed this text without errors.

How the model was initialised:

This error still occurs even for models trained from scratch without layer freezing.

Could someone let me know what could be going wrong?

nreimers commented 4 years ago

How does your complete code look like?

Do these examples work when you do not change anything? https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_nli.py https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_stsbenchmark.py

yuwon commented 4 years ago

I have the same issue.

As mentioned in https://github.com/UKPLab/sentence-transformers#training, I first downloaded NLI and STS data and tried to run training_nli.py.

However, I got the following error:

Iteration:   0%|          | 33/58880 [00:05<2:57:22,  5.53it/s]
Epoch:   0%|          | 0/1 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "/home/user/workspace/embedding_cluster/scripts/train_nli.py", line 73, in <module>
    model.fit(train_objectives=[(train_dataloader, train_loss)],
  File "/home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 407, in fit
    loss_value.backward()
  File "/home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/autograd/__init__.py", line 98, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` (gemm<float> at /pytorch/aten/src/ATen/cuda/CUDABlas.cpp:165)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f4d33142536 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xf4bc97 (0x7f4d344e9c97 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x13f589d (0x7f4d3499389d in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: THCudaTensor_addmm + 0x5c (0x7f4d3499d44c in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x1041c58 (0x7f4d345dfc58 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xf65018 (0x7f4d34503018 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x10c2780 (0x7f4d70e72780 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x2c9b47e (0x7f4d72a4b47e in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x10c2780 (0x7f4d70e72780 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::Tensor::mm(at::Tensor const&) const + 0xf0 (0x7f4d70a35930 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x28e6b5c (0x7f4d72696b5c in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::generated::MmBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x151 (0x7f4d72697961 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2d89705 (0x7f4d72b39705 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f4d72b36a03 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7f4d72b377e2 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f4d72b2fe59 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f4d7f4735f8 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #17: <unknown function> + 0xbd6df (0x7f4d802ff6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #18: <unknown function> + 0x76db (0x7f4d822886db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #19: clone + 0x3f (0x7f4d81fb1a3f in /lib/x86_64-linux-gnu/libc.so.6)

CUDA: 10.2 Nvidia Driver: 440.95.01 pytorch: 1.5.1 transformers: 3.0.2 sentence-transformers: 0.3.2

nreimers commented 4 years ago

Hi @yuwon, I tested the script with CUDA 9.2 and CUDA 10.1. For cuda 10.2, my installed driver is sadly too old. With CUDA 9.2 / CUDA 10.1. it works.

Sadly I don't know where the error comes from. Appears some issue with CUDA/Pytorch. Maybe you can try it with a different CUDA / pytorch version?

Best Nils Reimers

braaannigan commented 4 years ago

Hi @yuwon I had lots of problems like this. I've moved to developing in a docker container with an official pytorch cuda base image and never had problems again. Blog post on deving in docker here: http://braaannigan.github.io/software/2020/07/26/dev_in_docker.html

yuwon commented 4 years ago

Thanks @braaannigan. Yes, I've also tried with Pytorch's official Docker image but I also failed.

olastor commented 2 years ago

I also encountered the same error as @yuwon (CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm...). I have cuda 11.6 and had torch==1.8.2+cu111 installed. After uninstalling pytorch and installing the nightly version as suggested in https://github.com/allenai/allennlp/issues/5064#issuecomment-854086948 it now seems to work with cuda 11.

UKPLab / sentence-transformers

CUDA errors when using models that have been imported from HF and trained with SentenceTransformers #324