Closed jjsjann123 closed 1 month ago
This issue blocks https://github.com/Lightning-AI/lightning-thunder/pull/731 and is one of the reasons of the huge performance regression from qkv_split_rope.
Looking at the generated kernel, we are missing out vectorized load on non-padded tensor inputs. Launch params is also slightly different, but that might as well be a side effect from the lack of vectorization.
So I think the next step here is take a quick look at vectorization analysis.
An orthogonal topic is to support vectorization on PadOp
, which will be required in the presegmentation pass where we will be aggressively pushing out PadOp
to avoid kernel segmentation.
How embarrassing it is. the large regression might be coming from allocation order inference.
https://github.com/NVIDIA/Fuser/pull/2630 seems to resolve the larger rope regression locally for me. I'll rerun the benchmark tomorrow. :crossed_fingers:
Follow up with the benchmark to check #2630's performance impact: https://gist.github.com/jjsjann123/87345938c0dd0c12b83c2b8f4c42fa9c
Looks like it helped with the forward part at least. But there are still quite some regression remaining on backward.
looks like we are generating lots of pointwise kernels on backward rope. I'm suspecting those are just alias analysis not aggressively pushing things out. Will try my luck with #2608
:cry: backend regression is more than just alias stuff. I'm seeing kernel performance issue as well even with #2608 and #2630
The backward fusion pattern looks very similar, except some slice
at the beginning and the permute
at the end. https://gist.github.com/jjsjann123/87345938c0dd0c12b83c2b8f4c42fa9c?permalink_comment_id=5127263#gistcomment-5127263
I'm suspecting it's the permute that's giving us issue on vectorization again. I'll confirm that.
Observed that even after alias analysis removed leading meta operations from the generated kernel, we are observing performance regression.
There're some slight segmentation differences between the two. But the big kernel is almost identical.
Repro script are as below.
vs
The second script is segmented as below. So if we disgard the leading no-op segment, the computation looks the same as with the first script.
The kernel time we are getting from these two programs are very different.