Open jason-huang03 opened 2 months ago
Hi @jason-huang03 , which version of flashinfer you were using? I suppose the issue should have been fixed in 0.0.9.
I can't reproduce it with the latest version of flashinfer (v0.1.5).
I checkout to v0.1.5 and rebuild using pip install --no-cache-dir --force-reinstall -e .
. However, the problem persists. The whole error message is
CUDA Error: an illegal memory access was encountered (700) /mnt/huanghaofeng/flashinfer/python/include/flashinfer/attention/decode.cuh: line 658 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size)
Traceback (most recent call last):
File "/mnt/huanghaofeng/flashinfer/test.py", line 19, in <module>
o_rope_on_the_fly = flashinfer.single_decode_with_kv_cache(q, k, v, pos_encoding_mode="ROPE_LLAMA") # decode with LLaMA style RoPE on-the-fly
File "/mnt/huanghaofeng/flashinfer/python/flashinfer/decode.py", line 194, in single_decode_with_kv_cache
out = _decode.single_decode_with_kv_cache(
RuntimeError: SingleDecodeWithKVCache kernel launch failed, error: an illegal memory access was encountered
You can see that the problem is from cudaFuncSetAttribute
.
I am using cuda 11.8, torch 2.2.0 and in a containerized development environment. Can this be the problem?
Also I find that device_id in function SinglePrefillWithKVCacheDispatched
in python/include/flashinfer/attention/prefill.cuh
seems to be 0 regardless of the device_id set in the python code.
I use std::cout, device.index() here is empty, but device is correct (like cuda:1). I am now trying to use cuda 12.4 and torch 2.4 to see whether the problem can be solved.
After using pytorch 2.4 and cuda 12.4, the error disappears. Thanks for your time. It seems that the device and device index api has undergone some changes in the cuda or pytorch version.
thanks for reporting, I'll check the behavior on cu118 platforms.
This is from the given example in the repo:
When
device_id=0
, everything is fine. However, whendevice_id=1
, the following error is thrown:I am using A100 SM 80. I find that the problem should have been solved in the commit related to #349 but I still meet this weird problem. Can you see why it happens? Thanks a lot! I want to deploy 70B model on multiple gpus so I think being able to run the kernel on different gpus is really important. Can you see why it happens?