Triton Error [CUDA]: invalid argument

jemmyshin commented 1 year ago

Issue description:

Got CUDA Error when sending request to server.

Steps to reproduce:

python -m lightllm.server.api_server --model_dir ~/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348 --host 0.0.0.0 --port 8080 --tp 1 --max_total_token_num 120

And sending request using:

url = 'http://localhost:8080/generate'
headers = {'Content-Type': 'application/json'}
data = {
        'inputs': PROMPT,
        "parameters": {
        'do_sample': args.do_sample,
        'ignore_eos': False,
        'max_new_tokens': max_new_tokens,
        }
}
generated_text = requests.post(url, headers=headers, data=json.dumps(data)).json()

Error logging:

Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /home/jina/jemfu/lightllm/lightllm/server/router/manager.py:88> exception=RuntimeError('Triton Error [CUDA]: invalid argument')>
Traceback (most recent call last):
  File "<string>", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0-7d1eb0d2fed8ff2032dccb99c2cc311a-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'fp32', torch.int32, torch.int32, torch.float32, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (128, 128, 128), (True, True, True, (False,), True, True, True, True, (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (False, False), (False, True)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd
    await self._step()
  File "/home/lightllm/lightllm/server/router/manager.py", line 112, in _step
    await self._prefill_batch(self.running_batch)
  File "/home/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch
    ans = await asyncio.gather(*rets)
  File "/home/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 201, in prefill_batch
    ans = self._prefill_batch(batch_id)
  File "/home/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/home/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 77, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/home/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 128, in forward
    logits = self.model.forward(**kwargs)
  File "/home/miniconda3/envs/jemfu/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/lightllm/lightllm/models/llama/layer_infer/model.py", line 116, in forward
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/home/lightllm/lightllm/models/llama/layer_infer/model.py", line 154, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/home/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 117, in context_forward
    self._context_flash_attention(input_embdings,
  File "/home/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/home//lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 76, in _context_flash_attention
    context_attention_fwd(q.view(calcu_shape1),
  File "/home/miniconda3/envs/jemfu/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/jemfu/lightllm/lightllm/models/llama/triton_kernel/context_flashattention_nopad.py", line 224, in context_attention_fwd
    _fwd_kernel[grid](
  File "/home/miniconda3/envs/jemfu/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "<string>", line 43, in _fwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

Environment:

Please provide information about your environment, such as:

[ ] Using container
OS: Ubuntu
GPU info:
- NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6
- RTX TITAN
Python: 3.10
LightLLm: I used git clone and pip install -e .
openai-triton: 2.0.0.dev20221202

llehtahw commented 1 year ago

@jemmyshin sorry for late response.

Turing cards are not well tested by us currently. There might be some kernel issues or triton issues. See also #34.

@hiworldwzj is there any progress on T4 supporting now?

HelloCard commented 1 year ago

我遇到了相同的问题，使用WSL2，通过requirements.txt安装和直接拉取docker运行都会有这个问题。显卡是2080ti22G，cuda版本11.8.

hiworldwzj commented 1 year ago

@jemmyshin T4 can not be supported yet

hiworldwzj commented 1 year ago

我遇到了相同的问题，使用WSL2，通过requirements.txt安装和直接拉取docker运行都会有这个问题。显卡是2080ti22G，cuda版本11.8.

@HelloCard 你好， 2080ti 我估计当前还没法很好的支持，主要是图灵架构的卡，目前使用的triton版本无法有效的编译算子，不过3090，4090 这些安培架构的卡是可以很好的支持的。

ModelTC / lightllm

Triton Error [CUDA]: invalid argument #80