Open jjsjann123 opened 7 months ago
Tried to look at the perf again: With bookend optimization:
No bookend optimization:
Performance regression doesn't seem to have changed. :cry:
But as a sanity check, I looked at case shapes=(2, 16, 16),(16, 16), eq='bij, jk'
~seems like nvfuser is generating a pw kernel on a segmented fusion and that doesn't look right. I'm suspecting there's something I did wrong for allocation order inference.~
Looks like something changed in einsum:
def computation(t_1, t_2):
# t_1: "cuda:0 f32[2, 16, 16]"
# t_2: "cuda:0 f32[16, 16]"
[t1, t3] = nvFusion0(t_1, t_2)
# t2 = prims.transpose(t_1, (2, 1, 0)) # t2: "cuda:0 f32[16, 16, 2]"
# t3 = prims.reshape(t2, (16, 32)) # t3: "cuda:0 f32[16, 32]"
# t1 = prims.transpose(t_2, (1, 0)) # t1: "cuda:0 f32[16, 16]"
del t_1, t_2
t4 = torch.matmul(t1, t3) # t4: "cuda:0 f32[16, 32]"
# t4 = ltorch.matmul(t1, t3) # t4: "cuda:0 f32[16, 32]"
# t4 = prims.matmul(t1, t3) # t4: "cuda:0 f32[16, 32]"
del t1, t3
[t6] = nvFusion1(t4)
# t5 = prims.reshape(t4, (16, 16, 2)) # t5: "cuda:0 f32[16, 16, 2]"
# t6 = prims.transpose(t5, (2, 1, 0)) # t6: "cuda:0 f32[2, 16, 16]"
del t4
return t6
That the transpose and reshape now does ended up being a memory copy.
Can we measure whether this is a wallclock time (CPU overhead) or a kernel time regression? An NSight profile would show the comparison.
The example above gives the trace here. Ignoring thunder overhead and focus only on the impact from disabling bookend optimization in nvfuser executor
It's bloated with some extra latency from nvtx range, but looks like things surrounding nvfuser execute call isn't adding much overhead (I added two more marker on get_fd
and execute
for section in https://github.com/Lightning-AI/lightning-thunder/blob/bd18f0a1feeb0432a10bb17f45858c28cbd3ba29/thunder/executors/nvfuserex_impl.py#L401-L412).
The extra cpu latency (when bookend is disabled) is added by nvfuser. In terms of actual kernel time, the script runs tiny kernels and I'm not seeing any differences here at all.
Two follow up items:
Even though there's only one element-wise kernel, we do have two nvFusion segments, one of which is a no-op session. (I'm wondering if there's any runtime that can be reduced for those).
The einsum script here is just a toy example, part of the reason why we are getting hit hard by cpu latency. I'll try this with nano-gpt to see if there's any observed latency issue from a real benchmark.
Note to myself.
@wujingyue has some benchmark results in #731 . which is what I'm trying to evaluate in item 2 above.
🐛 Bug
Repro script from @nikitaved
With bookend optimization
Without bookend optimization
shapes=(2, 16, 16),(16, 16), eq='bij, jk'
This is the case where it regressed the most. Regression here comes through nvfuser host overhead. Bookend optimization removed everything going through nvfuser.
shapes=(2, 8, 16, 16),(2, 16, 8, 16), eq='bijk,bklj->bli'
Regression mostly comes from host overhead on this one.
Note that this doesn't look as bad, because this trace kernel actually does require a memory copy on
t_2->t1->t3
, which is also observed with bookend enabled. nvfuser generated kernel has slightly better perf than the eager kernel, while the other branch is handled by no-op (alias support in codegen).shapes=(2, 8, 16, 1),(2, 16, 8, 16), eq='bijk,bklj->bli' & shapes=(2, 8, 16, 16),(2, 16, 8, 1), eq='bijk,bklj->bli'
No significant regression observed. These are the positive cases, nvfuser has absorbed the transpose at the end of a reduction kernel but those were handled via aliases.
shapes=(2, 16, 1),(16, 16), eq='bij, jk' & shapes=(2, 16, 16),(1, 16), eq='bij, jk'
No regression observed, since the entire fusion is running through nvfuser, bookend optimization doesn't change the trace. (einsum is broken into sum + pointwise mul)