Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.2k stars 80 forks source link

Distributed and Bucketing Performance Improvements #348

Open parthmannan opened 6 months ago

parthmannan commented 6 months ago

🐛 Bug

This is a lengthy issue/post detailing my observations with our distributed and bucketing performance. Some of these are actionable items and some are just observations to be aware of.

FSDP ZeRO2 (Bucketing=None)

Screenshot 2024-05-02 at 1 20 32 PM

AllGather operations during the forward pass are launched before the computation begins. This is because the Thunder trace schedules the AllGather all before the computation and also calls the wait operators before any compute begins. The long line of operations in stream22 are all AG kernels. This is bad for performance because -

FSDP ZeRO2 (Bucketing=Block)

Screenshot 2024-05-02 at 1 34 15 PM

Is there a better way of allocating these buffers only in the first iteration and using portion of these buffers for the computation instead of concat+copy every iteration?

For example, below is the execution timeline for TorchInductor

Screenshot 2024-05-02 at 1 40 57 PM

FSDP ZeRO3 (Bucketing=None)

Screenshot 2024-05-02 at 1 43 05 PM

When using ZeRO3, the schedule is as expected with the AG kernels and compute kernels being interleaved. However, due to launch overheads and small message sizes without bucketing, there are many gaps where the compute is not being overlapped with communication. There is probably room for improvement in the launch overheads (maybe the schedule even?) to improve performance but there is no fundamental bug here. This is just an observation.

FSDP ZeRO3 (Bucketing=Block)

Screenshot 2024-05-02 at 1 54 03 PM Screenshot 2024-05-02 at 2 14 06 PM

I am writing all of this here to have an easy comparison of all the options tried and facilitate discussion. Please let me know if some of these require individual issues to track and I can create those.

cc @carmocca @awaelchli @crcrpar @IvanYashchuk @mruberry @t-vi @tfogal

IvanYashchuk commented 6 months ago

Thank you, Parth, for this excellent analysis and accompanying screenshots!

AllGather operations during the forward pass are launched before the computation begins.

At some point, our sorting broke and we need to restore the intended functionality, here's the issue for this: https://github.com/Lightning-AI/lightning-thunder/issues/277

Is there a better way of allocating these buffers only in the first iteration and using portion of these buffers for the computation instead of concat+copy every iteration?

Yes, there's a better way, if we used a special interleaving copy it should be possible to do fewer copies and more views. We don't have an issue tracking this, but creating microbenchmarks for bucketing is in our plans.

parthmannan commented 5 months ago

Update: ZeRO2 AllGather overlap issues were fixed in #383 and the performance is looking much better now.