bd-iaas-us / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
4 stars 1 forks source link

[Feature]: Support flashdecoding++ async unified softmax #20

Open chizhang118 opened 3 months ago

chizhang118 commented 3 months ago

flashdecoding++ paper: https://arxiv.org/abs/2311.01282

chizhang118 commented 3 months ago

Primarily related method paged_attention_v2_reduce_kernel: https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cu#L567

chizhang118 commented 3 months ago

20240802-144852

chizhang118 commented 3 months ago

20240802-145033

JackChuang commented 3 months ago

I'm taking this issue!! Please assign this to me. Thanks.

JackChuang commented 3 months ago

Check different block sizes and max qk sizes for a 7b model. Try to change the implementation of the qk calculation.

JackChuang commented 2 months ago

Confirm some implementation-related details:

JackChuang commented 2 months ago

-- Tried different qk_max values to understand how it affects the precision of the inference. -- Collected the profiling data about qk_max, wrote the script to find the range (99%), and figured out an appropriate value to be used.

JackChuang commented 2 months ago

Goal:

Confirm some implementation-related details:

JackChuang commented 1 month ago
JackChuang commented 1 month ago

Goal: [-] Comparing perf results (token/s): vanilla dAttn vs unified qk_max dAttn [O] Implement per-head shifting window dAttention code (for ongoing patent writing) [-] Comparing rollback counts: unified qk_max dAttn vs shifting window (phi) dAttn (for ongoing patent writing)

JackChuang commented 1 month ago