[Feature request] Adding optional `cpu_indptr`/`cpu_qo_indptr` parameter to `plan` method to avoid synchronized device to host copy.

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

https://flashinfer.ai

Apache License 2.0

1.47k stars 145 forks source link

[Feature request] Adding optional `cpu_indptr`/`cpu_qo_indptr` parameter to `plan` method to avoid synchronized device to host copy. #565

Open reyoung opened 4 weeks ago

reyoung commented 4 weeks ago

The current plan methods appear to include unnecessary device-to-CPU copies, as seen in the following links:

https://github.com/flashinfer-ai/flashinfer/blob/d30667b0a23c1cc9135f7557404409ca1a9b9f02/python/flashinfer/decode.py#L722 https://github.com/flashinfer-ai/flashinfer/blob/d30667b0a23c1cc9135f7557404409ca1a9b9f02/python/flashinfer/prefill.py#L1006-L1008

These copies are unnecessary because qo_indptr and append_indptr should be initialized during data preparation and are originally on the CPU. Therefore, the plan method could accept an optional indptr parameter. The tensor.to("cpu") operation should only be invoked if this indptr parameter is None.

yzh119 commented 4 weeks ago

Hi @reyoung , you are right, if the indptr arrays are on CPU, these operations are not necessary. However, we didn't note that these arrays should be on CPU and some of the frameworks that integrates flashinfer have legacy issues of using GPU tensors, such conversion is mainly for backward compatibility.

If the tensors are already on CPU, then the indptr_host = indptr.to("cpu") is a no-op and will not result in a copy.

>>> import torch
>>> x = torch.rand(12)
>>> x.data_ptr()
94804334735872
>>> y = x.to("cpu")
>>> y.data_ptr()
94804334735872

reyoung commented 4 weeks ago

However， when indptr is CPU tensor， an unnecessary CPU -> Device copy will be invoked here

https://github.com/flashinfer-ai/flashinfer/blob/d30667b0a23c1cc9135f7557404409ca1a9b9f02/python/flashinfer/prefill.py#L990

It seems that the current API cannot avoid cross device copy.

reyoung commented 4 weeks ago

I found the kv_indptr should always be on CPU, and qo_indptr is used in both CPU/CUDA.

Maybe it is better to only add cpu_qo_indptr parameter.