Open wwlaoxi opened 1 year ago
看起来是把第一次输出的history
保存下来,再送给第二次chat
?
没遇到类似情况,要不然打印出来保存的history
变量看看要不然
相同的问题同问, GPU 推理时,batch-size 设为1,在 for 循环中调用 chat() 函数,如下所示,可以看到每次循环时 history 都为空,在推理了 N 次后,就会报这个错误,起初猜想是这种调用方式导致了每次送入模型的 prompt 都计算在了一个上下文长度中,当这个长度超过了 2048 后就会出这个错误,但后来发现每次报错时经过的样本数量不同,有时明显可以算出送入模型的累计上下文超过了 2048,所以应该不是这个原因,希望有知道的大神可以指点迷津(求)
for i in range(len(prompts)): response, history = self.model.chat(self.tokenizer, prompts[i], history=[]) print(f"response is {response}") ``` ) /
Is there an existing issue for this?
Current Behavior
用的是/chatGLM-6B-int4模型,在cpu推理时一切正常。当在gpu推理时,如果history为空,则不报错。如果history不为空,则直接报cuda错误。具体错误如下: Python 3.8.0 (default, Nov 6 2019, 21:49:08) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.
-------------------上述正常的,如果再次推理时history不为空则报错-----------------
Traceback (most recent call last): File "", line 1, in
File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, kwargs)
File "/home/wwl/.cache/huggingface/modules/transformers_modules/chatGLM-6B-int4/modeling_chatglm.py", line 1285, in chat
outputs = self.generate(inputs, gen_kwargs)
File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, *kwargs)
File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/transformers/generation/utils.py", line 1452, in generate
return self.sample(
File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/transformers/generation/utils.py", line 2468, in sample
outputs = self(
File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, kwargs)
File "/home/wwl/.cache/huggingface/modules/transformers_modules/chatGLM-6B-int4/modeling_chatglm.py", line 1204, in forward
lm_logits = self.lm_head(hidden_states).permute(1, 0, 2).contiguous()
File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wwl/anaconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)
报错代码位置:模型自带的quantization.py中,class W8A16Linear(torch.autograd.Function): 类forward中 output = inp.mm(weight.t()) 报错了 其中inp.shape是torch.Size([906, 4096])。 weight.shape是torch.Size([12288, 4096])
Expected Behavior
No response
Steps To Reproduce
Environment
Anything else?
No response