RulinShao / LightSeq

Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
179 stars 8 forks source link

The test results of “fastckpt” are not as expected #10

Open LzhinFdu opened 3 months ago

LzhinFdu commented 3 months ago

https://github.com/RulinShao/FastCkpt/issues/2#issue-2287002399 After running the test script provided by fastckpt, I found that compared to flash-attn, fastckpt is slower and the results have diff. I use: transformers==4.40.1 flash-attn==2.5.6 What could be the cause?

LzhinFdu commented 3 months ago

After setting repeat=True with different sequence_length, I got the following results. Are these results as expected? (When seq_len=1024, there is an obvious diff in the grad value; as seq_len increases, fastckpt does not show an obvious speed advantage.) image image image (sry... here 'mean' actually refers to the maximum element value of the “diff“)