Can burst-attention be used in Model Inference?

MayDomine / Burst-Attention

Distributed IO-aware Attention algorithm

Apache License 2.0

17 stars 0 forks source link

Can burst-attention be used in Model Inference? #3

Open gitcloneman opened 5 months ago

gitcloneman commented 5 months ago

Thanks for the great work and promising performance in model training. are you considering apply and simplify burst-attention on model inference? what gaps are there compared to ring attention with FSDP?

MayDomine commented 5 months ago

For this first question: yes, burst-attention can be used in pre-fill stage of inference. For decoding stage, TP should be a better way. The switch between Burst and TP can be easily done, We are working on this now.

For the second one: The idea is quite similar with Ring-Flash with FSDP only with different implementation details and different optimization tricks.

gitcloneman commented 5 months ago

looking forward to the preview of The switch between Burst and TP can be easily done, We are working on this now.