perm: use page-locked host memory for auxiliary data structure on CPU

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

https://flashinfer.ai

Apache License 2.0

822 stars 77 forks source link

perm: use page-locked host memory for auxiliary data structure on CPU #253

Closed yzh119 closed 1 month ago

yzh119 commented 1 month ago

we observed that the multiple cudaMemcpyAsync still incurs non-negligible overhead in BeginForward functions, this PR accelerates BeginForward by:

pre-allocate paged-locked host memory (pinned memory) for host-side page data structures.
Issue a single cudaMemcpyAsync that copies all page data structure.

cc @tqchen