Open reyoung opened 4 weeks ago
Hi @reyoung , you are right, if the indptr arrays are on CPU, these operations are not necessary. However, we didn't note that these arrays should be on CPU and some of the frameworks that integrates flashinfer have legacy issues of using GPU tensors, such conversion is mainly for backward compatibility.
If the tensors are already on CPU, then the indptr_host = indptr.to("cpu")
is a no-op and will not result in a copy.
>>> import torch
>>> x = torch.rand(12)
>>> x.data_ptr()
94804334735872
>>> y = x.to("cpu")
>>> y.data_ptr()
94804334735872
However, when indptr is CPU tensor, an unnecessary CPU -> Device copy will be invoked here
It seems that the current API cannot avoid cross device copy.
I found the kv_indptr
should always be on CPU, and qo_indptr
is used in both CPU/CUDA.
Maybe it is better to only add cpu_qo_indptr
parameter.
The current plan methods appear to include unnecessary device-to-CPU copies, as seen in the following links:
https://github.com/flashinfer-ai/flashinfer/blob/d30667b0a23c1cc9135f7557404409ca1a9b9f02/python/flashinfer/decode.py#L722 https://github.com/flashinfer-ai/flashinfer/blob/d30667b0a23c1cc9135f7557404409ca1a9b9f02/python/flashinfer/prefill.py#L1006-L1008
These copies are unnecessary because qo_indptr and append_indptr should be initialized during data preparation and are originally on the CPU. Therefore, the plan method could accept an optional indptr parameter. The tensor.to("cpu") operation should only be invoked if this indptr parameter is None.