OpenBMB / BMInf

Efficient Inference for Big Models
Apache License 2.0
572 stars 67 forks source link

[BUG] RuntimeError: cublas error: CUBLAS_STATUS_NOT_SUPPORTED #22

Closed mirrorange closed 3 years ago

mirrorange commented 3 years ago

在Google Colab提供的 12G RAM,Tesla K80 GPU运行时上运行。 NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2

报错如下:

RuntimeError Traceback (most recent call last)

in () 25 print("Loading model") 26 cpm2_1 = bminf.models.CPM2() ---> 27 generate(cpm2_1, input_text) in generate(model, text) 16 temperature=0.85, 17 frequency_penalty=0, ---> 18 presence_penalty=0, 19 ) 20 text += value /content/BMInf/bminf/models/cpm2.py in generate(self, input_sentence, max_tokens, top_n, top_p, temperature, frequency_penalty, presence_penalty, stop_tokens) 217 [len(input_sentence)], 218 max_tokens, top_n, top_p, temperature, --> 219 frequency_penalty, presence_penalty, 189 220 ) 221 /content/BMInf/bminf/models/cpm2.py in pre_processing(self, input_sentence, spans_position, max_tokens, top_n, top_p, temperature, frequency_penalty, presence_penalty, start_span_idx) 101 input_length = len(idx) 102 --> 103 ctx = self.encode(np.array([idx], dtype=np.int64), [input_length]) 104 self.init_decoder_context(ctx) 105 /content/BMInf/bminf/arch/t5/model.py in encode(self, input_idx, input_length) 236 encoder_attn_mask, 237 x_pos, --> 238 True 239 ) 240 with calc_stream: /content/BMInf/bminf/layers/transformer_block.py in forward(self, allocator, hidden_state, attention_mask, self_attn_position_bias, inplace) 40 41 logger.info("Encoder transformer block -- self attention") ---> 42 x = self.self_attention.forward(allocator, x, attention_mask, self_attn_position_bias) 43 assert x.dtype == cupy.float16 44 assert x.shape == (batch_size, dim_model, seq_len) /content/BMInf/bminf/layers/attention.py in forward(self, allocator, hidden_state, attention_mask, self_attn_position_bias) 61 self.w_project_qkv.value[i:i+1], 62 False, ---> 63 qkv_i32 64 ) 65 elementwise_copy_scale( /content/BMInf/bminf/functions/gemm.py in igemm(allocator, a, aT, b, bT, c) 84 device = a.device 85 stream = cupy.cuda.get_current_stream() ---> 86 _igemm(allocator, a, aT, b, bT, c, device, stream) 87 return c 88 /content/BMInf/bminf/functions/gemm.py in _igemm(allocator, a, aT, b, bT, c, device, stream) 263 0, 264 0, --> 265 stream.ptr 266 )) 267 if c.shape[2] != trans_ldc: /content/BMInf/bminf/backend/cublaslt.py in checkCublasStatus(cublas_status) 99 return 100 if cublas_status in cublas_errors: --> 101 raise RuntimeError("cublas error: %s" % cublas_errors[cublas_status]) 102 else: 103 raise RuntimeError("cublas error code: %d" % cublas_status) RuntimeError: cublas error: CUBLAS_STATUS_NOT_SUPPORTED 该笔记本的全部代码如下: --------------------------------------------------------------------------- !git clone https://github.com/OpenBMB/BMInf.git %cd BMInf !python setup.py install import bminf import sys def generate(model : bminf.models.CPM2, text): print("Input: ", text) sys.stdout.write("Output: %s" % text) stoped = False while not stoped: value, stoped = model.generate( input_sentence = text[-32:], max_tokens=32, top_n=5, top_p=None, temperature=0.85, frequency_penalty=0, presence_penalty=0, ) text += value sys.stdout.write(value) sys.stdout.flush() sys.stdout.write("\n") input_text = input("请输入提示内容:") print("Loading model") cpm2_1 = bminf.models.CPM2() generate(cpm2_1, input_text)
a710128 commented 3 years ago

similar to #13