GPU推理时，history不为空就报cuda error

wwlaoxi commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

用的是/chatGLM-6B-int4模型，在cpu推理时一切正常。当在gpu推理时，如果history为空，则不报错。如果history不为空，则直接报cuda错误。具体错误如下： Python 3.8.0 (default, Nov 6 2019, 21:49:08) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.

from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("/home/wwl/chatGLM-6B-int4", trust_remote_code=True, revision="") model = AutoModel.from_pretrained("/home/wwl/chatGLM-6B-int4", trust_remote_code=True, revision="").half().cuda() No compiled kernel found. Compiling kernels : /home/wwl/.cache/huggingface/modules/transformers_modules/chatGLM-6B-int4/quantization_kernels_parallel.c Compiling gcc -O3 -fPIC -pthread -fopenmp -std=c99 /home/wwl/.cache/huggingface/modules/transformers_modules/chatGLM-6B-int4/quantization_kernels_parallel.c -shared -o /home/wwl/.cache/huggingface/modules/transformers_modules/chatGLM-6B-int4/quantization_kernels_parallel.so Load kernel : /home/wwl/.cache/huggingface/modules/transformers_modules/chatGLM-6B-int4/quantization_kernels_parallel.so Setting CPU quantization kernel threads to 6 Using quantization cache Applying quantization to glm layers

model= model.eval() response, history = model.chat(tokenizer, "哈哈", history=[]) The dtype of attention mask (torch.int64) is not bool print(response) 你好，有什么我可以帮助你的吗？

-------------------上述正常的，如果再次推理时history不为空则报错-----------------

response, history = model.chat(tokenizer, "哈哈", history=history)

Traceback (most recent call last): File "", line 1, in File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/wwl/.cache/huggingface/modules/transformers_modules/chatGLM-6B-int4/modeling_chatglm.py", line 1285, in chat outputs = self.generate(inputs, gen_kwargs) File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/transformers/generation/utils.py", line 1452, in generate return self.sample( File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/transformers/generation/utils.py", line 2468, in sample outputs = self( File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/wwl/.cache/huggingface/modules/transformers_modules/chatGLM-6B-int4/modeling_chatglm.py", line 1204, in forward lm_logits = self.lm_head(hidden_states).permute(1, 0, 2).contiguous() File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

报错代码位置：模型自带的quantization.py中，class W8A16Linear(torch.autograd.Function): 类forward中 output = inp.mm(weight.t()) 报错了其中inp.shape是torch.Size([906, 4096])。 weight.shape是torch.Size([12288, 4096])

Expected Behavior

No response

Steps To Reproduce

from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("/home/wwl/chatGLM-6B-int4", trust_remote_code=True, revision="") model = AutoModel.from_pretrained("/home/wwl/chatGLM-6B-int4", trust_remote_code=True, revision="").half().cuda() model= model.eval() response, history = model.chat(tokenizer, "哈哈", history=[]) response, history = model.chat(tokenizer, "哈哈", history=history) 此时报错

Environment

- OS:Ubuntu 18.04
- Python:3.8
- Transformers:4.27.1
- PyTorch:1.12.0   py3.8_cuda10.2_cudnn7.6.5_0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True
-CUDA 10.2.89
-CUDNN 7.6.5

Anything else?

No response

xiabo0816 commented 1 year ago

看起来是把第一次输出的history保存下来，再送给第二次chat？没遇到类似情况，要不然打印出来保存的history变量看看要不然

sevenzard commented 1 year ago

相同的问题同问， GPU 推理时，batch-size 设为1，在 for 循环中调用 chat() 函数，如下所示，可以看到每次循环时 history 都为空，在推理了 N 次后，就会报这个错误，起初猜想是这种调用方式导致了每次送入模型的 prompt 都计算在了一个上下文长度中，当这个长度超过了 2048 后就会出这个错误，但后来发现每次报错时经过的样本数量不同，有时明显可以算出送入模型的累计上下文超过了 2048，所以应该不是这个原因，希望有知道的大神可以指点迷津（求）
for i in range(len(prompts))：
response, history = self.model.chat(self.tokenizer, prompts[i], history=[])
print(f"response is {response}")
``` ）
/

THUDM / ChatGLM-6B