RulinShao / LightSeq

Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
179 stars 8 forks source link

how to use your code in other models? #5

Closed zxgx closed 8 months ago

zxgx commented 8 months ago

Hi,

Your code seems pretty neat. I want to try your solution in other transformer models. Can I directly use another model name defined by huggingface as the ModelArguments?

DachengLi1 commented 8 months ago

hi @zxgx thank you very much! The attention part should be generic to other autoregressive models: https://github.com/RulinShao/LightSeq/blob/main/lightseq/lightseq_async_attn.py#L436. You will only need to do a monkey patch to the attention. We didn't test it in other models, but let us know if you encounter any issues!

RulinShao commented 8 months ago

hi @zxgx thank you very much! The attention part should be generic to other autoregressive models: https://github.com/RulinShao/LightSeq/blob/main/lightseq/lightseq_async_attn.py#L436. You will only need to do a monkey patch to the attention. We didn't test it in other models, but let us know if you encounter any issues!

PS: we wrote the lightseq attention api in a similar way as FlashAttention, so one straightforward way is to find where the codes patch FlashAttention and replace FA with LightSeq attention that @DachengLi1 linked to.

zxgx commented 8 months ago

Great, thank you for your reply! 😄