flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
822 stars 77 forks source link

perm: use page-locked host memory for auxiliary data structure on CPU #253

Closed yzh119 closed 1 month ago

yzh119 commented 1 month ago

we observed that the multiple cudaMemcpyAsync still incurs non-negligible overhead in BeginForward functions, this PR accelerates BeginForward by:

  1. pre-allocate paged-locked host memory (pinned memory) for host-side page data structures.
  2. Issue a single cudaMemcpyAsync that copies all page data structure.

cc @tqchen