Closed Tomorrowdawn closed 2 months ago
The error has gone after I set torch_current_stream
to nullptr
(default stream):
https://github.com/flashinfer-ai/flashinfer/blob/dc2c76f8577d8695112b61d1fd43ef88569272ef/python/csrc/page.cu#L68
This is because getCurrentCUDAStream()
returns the current CUDA stream of current GPU (0), which doesn't work for GPU 1. We should consider the input tensor gpu rank and set stream correspondingly.
A temporary workaround (before we release v0.0.8) is to set current GPU to 1 explicitly:
DEVICE = 'cuda:1'
torch.cuda.set_device(DEVICE)
shape = (8, 2, 128)
dtype = torch.float16
ragged_keys = torch.randn(shape, dtype=dtype).to(DEVICE)
ragged_values = torch.randn(shape, dtype=dtype).to(DEVICE)
query_indptr = torch.tensor([0, 8], dtype=torch.int32).to(DEVICE)
cache = torch.empty((100, 2, 16, 2, 128), dtype = dtype).to(DEVICE)
kv_indices = torch.tensor([0], dtype=torch.int32).to(DEVICE)
kv_last_page_lens = torch.tensor([8], dtype=torch.int32).to(DEVICE)
kv_indptr = torch.tensor([0, 1], dtype = torch.int32).to(DEVICE)
torch.cuda.synchronize()##everything is fine until here.
flashinfer.page.append_paged_kv_cache(
ragged_keys,
ragged_values,
query_indptr,
cache,
kv_indices,
kv_indptr,
kv_last_page_lens,
)
minimal reproduction showcase:
When DEVICE is set to 'cuda:0', everything works as expected. However, setting it to any other device (e.g., 'cuda:1') results in a runtime error:
After numerous trials, I am confident that this issue is not hardware-dependent.
A temporary workaround is to use CUDA_VISIBLE_DEVICES=1, which allows the code to run correctly (since PyTorch assumes it is working on cuda:0). However, this approach limits the utilization of multiple GPUs. This feature should be very important, as many large models are incapable of performing inference on a single GPU.
This does indeed seem quite strange; I took a cursory look at the CUDA code, yet I'm at a loss as to how such an odd error could occur.