[Feature]: Support flashdecoding++ async unified softmax

bd-iaas-us / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

4 stars 1 forks source link

[Feature]: Support flashdecoding++ async unified softmax #20

Open chizhang118 opened 3 months ago

chizhang118 commented 3 months ago

flashdecoding++ paper: https://arxiv.org/abs/2311.01282

Q3 Collaboration Plan of Infra and IaaS Labs: https://bytedance.us.larkoffice.com/docx/HKXfdRh1noMrbAxcgL2ureGasdQ
FlashDecoding++ Summary: https://bytedance.larkoffice.com/wiki/WbqXwRL3qi0x18kJVkAcU0HZnte, including Asynchronized Softmax, Double Buffering, Heuristic Dataflow, and more profiling-related optimizations.
The new kv cache memory layout and performance benchmarks is also needed.
Inference Improvement Weekly Status and Progress: https://bytedance.us.larkoffice.com/docx/RGnPdj5gfoBN3YxuY3yuLtlQsjb

chizhang118 commented 3 months ago

Primarily related method paged_attention_v2_reduce_kernel: https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cu#L567

chizhang118 commented 3 months ago

20240802-144852

chizhang118 commented 3 months ago

20240802-145033

JackChuang commented 3 months ago

I'm taking this issue!! Please assign this to me. Thanks.

JackChuang commented 3 months ago

Check different block sizes and max qk sizes for a 7b model. Try to change the implementation of the qk calculation.

figure out if the reduce kernel requires the same modification.

JackChuang commented 2 months ago

Confirm some implementation-related details:

How to check overflow, how does the overflow impact, e.g., precision or crash?
Check the usage of red_smem. Why is qk_max saved here?
Check partition>1 + 99.99% again.
Check why adding a space at the beginning of the prompts changes the range of the qk_max distribution.

JackChuang commented 2 months ago

-- Tried different qk_max values to understand how it affects the precision of the inference. -- Collected the profiling data about qk_max, wrote the script to find the range (99%), and figured out an appropriate value to be used.

JackChuang commented 2 months ago

Goal:

Change vllm/dattention's qk_max implementation when -PARTITION_SIZE = 1
Get some perf numbers

Confirm some implementation-related details:

Check how the qk_max affects PARTITION_SIZE > 1 cases

JackChuang commented 1 month ago

dAttention's qk_max MRs were merged
Started implementing per-head shifting window dAttention code

JackChuang commented 1 month ago

Goal: [-] Comparing perf results (token/s): vanilla dAttn vs unified qk_max dAttn [O] Implement per-head shifting window dAttention code (for ongoing patent writing) [-] Comparing rollback counts: unified qk_max dAttn vs shifting window (phi) dAttn (for ongoing patent writing)

JackChuang commented 1 month ago

Solved the unexpected result problem by resolving some bugs and problems.
Collected rollback numbers and end-to-end perf numbers. Discussing the results.