feifeibear / long-context-attention

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
Apache License 2.0
349 stars 24 forks source link

显存占用问题 #56

Closed realgump closed 5 months ago

realgump commented 5 months ago
image

您好,论文表里面的zero-1和dp的激活值为什么是A/N呢?如果是这样的话,sp+zero-1相比zero-1似乎没有优势。

feifeibear commented 5 months ago

zero-dp没有通信activation,只通信了weight和grad。