Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.11k stars 67 forks source link

einsum performance regression after disabling bookend optimization #193

Open jjsjann123 opened 4 months ago

jjsjann123 commented 4 months ago

🐛 Bug

Repro script from @nikitaved

import itertools
import torch
from torch.utils.benchmark import Timer, Compare
import thunder

#dtypes = [torch.bfloat16, torch.float, torch.double]
# NOTE: we know that it's host latency that's mostly triggering this issue, so not swiping across dtypes
dtypes = [torch.float]

def f(eq, *inputs):
    return torch.einsum(eq, *inputs)

f_thunder = thunder.jit(f, nv_enable_bookend=False)
f_torchcompile = torch.compile(f)

results = []

# Largest dim
#for n in (16, 128, 512):
# NOTE: we know that it's host latency that's mostly triggering this issue, so not swiping across `n`
for n in (16,):
    cases = (
        # single contraction dim
        # no broadcasting
        ([(2, n, n), (n, n)], "bij, jk"),
        # left broadcasts
        ([(2, n, 1), (n, n)], "bij, jk"),
        # right broadcasts
        ([(2, n, n), (1, n)], "bij, jk"),
        # multidim contraction
        # no broadcasting
        ([(2, 8, n, n), (2, n, 8, n)], "bijk,bklj->bli"),
        # left broadcasts
        ([(2, 8, n, 1), (2, n, 8, n)], "bijk,bklj->bli"),
        # right broadcasts
        ([(2, 8, n, n), (2, n, 8, 1)], "bijk,bklj->bli"),
    )
    # NOTE: only run thunder since we are only investigating bookend
    #for func, func_name in ((f, "f"), (f_thunder, "f_thunder"), (f_torchcompile, "f_torchcompile")):
    for func, func_name in ((f_thunder, "f_thunder"),):
        for dtype in dtypes:
            for shapes, eq in cases:
                inputs = [torch.rand(*shape, dtype=dtype, device='cuda') / (10 ** len(shape)) for shape in shapes]

                # warm-up
                func(eq, *inputs)

                shape_dscr = ','.join(str(shape) for shape in shapes)
                timer = Timer(
                    f"{func_name}(eq, *inputs)",
                    globals=globals(),
                    label=f"einsum, {n=}",
                    description=f"{func_name}",
                    sub_label=f"{dtype}, shapes={shape_dscr}, {eq=}",
                    #num_threads=torch.get_num_threads()
                )
                results.append(timer.blocked_autorange())

compare = Compare(results)
compare.trim_significant_figures()
compare.print()

With bookend optimization

[-------------------------------------- einsum, n=16 --------------------------------------]
                                                                                |  f_thunder
1 threads: ---------------------------------------------------------------------------------
      torch.float32, shapes=(2, 16, 16),(16, 16), eq='bij, jk'                  |      65
      torch.float32, shapes=(2, 16, 1),(16, 16), eq='bij, jk'                   |      67
      torch.float32, shapes=(2, 16, 16),(1, 16), eq='bij, jk'                   |      84
      torch.float32, shapes=(2, 8, 16, 16),(2, 16, 8, 16), eq='bijk,bklj->bli'  |      68
      torch.float32, shapes=(2, 8, 16, 1),(2, 16, 8, 16), eq='bijk,bklj->bli'   |     115
      torch.float32, shapes=(2, 8, 16, 16),(2, 16, 8, 1), eq='bijk,bklj->bli'   |     119

Times are in microseconds (us).

Without bookend optimization

[-------------------------------------- einsum, n=16 --------------------------------------]
                                                                                |  f_thunder
1 threads: ---------------------------------------------------------------------------------
      torch.float32, shapes=(2, 16, 16),(16, 16), eq='bij, jk'                  |     130
      torch.float32, shapes=(2, 16, 1),(16, 16), eq='bij, jk'                   |      67
      torch.float32, shapes=(2, 16, 16),(1, 16), eq='bij, jk'                   |      84
      torch.float32, shapes=(2, 8, 16, 16),(2, 16, 8, 16), eq='bijk,bklj->bli'  |     112
      torch.float32, shapes=(2, 8, 16, 1),(2, 16, 8, 16), eq='bijk,bklj->bli'   |     112
      torch.float32, shapes=(2, 8, 16, 16),(2, 16, 8, 1), eq='bijk,bklj->bli'   |     112

Times are in microseconds (us).

shapes=(2, 16, 16),(16, 16), eq='bij, jk'

This is the case where it regressed the most. Regression here comes through nvfuser host overhead. Bookend optimization removed everything going through nvfuser.

def computation(t_1, t_2):
  # t_1: "cuda:0 f32[2, 16, 16]"
  # t_2: "cuda:0 f32[16, 16]"
  [t1, t3] = nvFusion0(t_1, t_2)
    # t2 = prims.transpose(t_1, (2, 0, 1))  # t2: "cuda:0 f32[16, 2, 16]"
    # t3 = prims.reshape(t2, (16, 32))  # t3: "cuda:0 f32[16, 32]"
    # t1 = prims.transpose(t_2, (1, 0))  # t1: "cuda:0 f32[16, 16]"
  del t_1, t_2
  t4 = torch.matmul(t1, t3)  # t4: "cuda:0 f32[16, 32]"
    # t4 = ltorch.matmul(t1, t3)  # t4: "cuda:0 f32[16, 32]"
      # t4 = prims.matmul(t1, t3)  # t4: "cuda:0 f32[16, 32]"
  del t1, t3
  [t6] = nvFusion1(t4)
    # t5 = prims.reshape(t4, (16, 2, 16))  # t5: "cuda:0 f32[16, 2, 16]"
    # t6 = prims.transpose(t5, (1, 2, 0))  # t6: "cuda:0 f32[2, 16, 16]"
  del t4
  return t6

shapes=(2, 8, 16, 16),(2, 16, 8, 16), eq='bijk,bklj->bli'

Regression mostly comes from host overhead on this one.

def computation(t_1, t_2):
  # t_1: "cuda:0 f32[2, 8, 16, 16]"
  # t_2: "cuda:0 f32[2, 16, 8, 16]"
  [t3, t4] = nvFusion0(t_1, t_2)
    # t2 = prims.transpose(t_1, (0, 2, 3, 1))  # t2: "cuda:0 f32[2, 16, 16, 8]"
    # t4 = prims.reshape(t2, (2, 256, 8))  # t4: "cuda:0 f32[2, 256, 8]"
    # t1 = prims.transpose(t_2, (0, 2, 3, 1))  # t1: "cuda:0 f32[2, 8, 16, 16]"
    # t3 = prims.reshape(t1, (2, 8, 256))  # t3: "cuda:0 f32[2, 8, 256]"
  del t_1, t_2
  t5 = torch.matmul(t3, t4)  # t5: "cuda:0 f32[2, 8, 8]"
    # t5 = ltorch.matmul(t3, t4)  # t5: "cuda:0 f32[2, 8, 8]"
      # t5 = prims.matmul(t3, t4)  # t5: "cuda:0 f32[2, 8, 8]"
  del t3, t4
  return t5

Note that this doesn't look as bad, because this trace kernel actually does require a memory copy on t_2->t1->t3, which is also observed with bookend enabled. nvfuser generated kernel has slightly better perf than the eager kernel, while the other branch is handled by no-op (alias support in codegen).

Segmented_Fusion Dump: -- fusion segments:
Segmented_Fusion{
groups:
g{2, 4, 6}

g{0, 7, 9}

edges:

group details:
g{(pointwise)
inputs:
T1_g[ iS4{i5}, iS5{i6}, iS6{i7}, iS7{i8} ] float
outputs:
T6_g[ iS30{i5}, iS31{i7}, iS36{( i8 * i6 )}rf ] float

T4_g[ iS19{i5}, iS21{i7}, iS22{i8}, iS20{i6} ]
   = Set.Permute( T1_g[ iS4{i5}, iS5{i6}, iS6{i7}, iS7{i8} ], cache_op=Streaming )
(2)
T6_g[ iS30{i5}, iS31{i7}, iS36{( i8 * i6 )}rf ] = view( T4_g[ iS19{i5}, iS21{i7}, iS22{i8}, iS20{i6} ] )
(6)
i41 = i8 * i6;
(4)
}

g{(no_op)
inputs:
T0_g[ iS0{i0}, iS1{i1}, iS2{i2}, iS3{i3} ] float
outputs:
T7_g[ iS37{i0}, iS43{( i2 * i3 )}rf, iS40{i1} ] float

T2_g[ iS8{i0}, iS10{i2}, iS11{i3}, iS9{i1} ]
   = Set.Permute( T0_g[ iS0{i0}, iS1{i1}, iS2{i2}, iS3{i3} ], cache_op=Streaming )
(0)
T7_g[ iS37{i0}, iS43{( i2 * i3 )}rf, iS40{i1} ] = view( T2_g[ iS8{i0}, iS10{i2}, iS11{i3}, iS9{i1} ] )
(9)
i44 = i2 * i3;
(7)
}

} //Segmented_Fusion

shapes=(2, 8, 16, 1),(2, 16, 8, 16), eq='bijk,bklj->bli' & shapes=(2, 8, 16, 16),(2, 16, 8, 1), eq='bijk,bklj->bli'

No significant regression observed. These are the positive cases, nvfuser has absorbed the transpose at the end of a reduction kernel but those were handled via aliases.

shapes=(2, 16, 1),(16, 16), eq='bij, jk' & shapes=(2, 16, 16),(1, 16), eq='bij, jk'

No regression observed, since the entire fusion is running through nvfuser, bookend optimization doesn't change the trace. (einsum is broken into sum + pointwise mul)

jjsjann123 commented 2 months ago

Tried to look at the perf again: With bookend optimization: image

No bookend optimization: image

Performance regression doesn't seem to have changed. :cry:

But as a sanity check, I looked at case shapes=(2, 16, 16),(16, 16), eq='bij, jk' ~seems like nvfuser is generating a pw kernel on a segmented fusion and that doesn't look right. I'm suspecting there's something I did wrong for allocation order inference.~

Looks like something changed in einsum:

def computation(t_1, t_2):
  # t_1: "cuda:0 f32[2, 16, 16]"
  # t_2: "cuda:0 f32[16, 16]"
  [t1, t3] = nvFusion0(t_1, t_2)
    # t2 = prims.transpose(t_1, (2, 1, 0))  # t2: "cuda:0 f32[16, 16, 2]"
    # t3 = prims.reshape(t2, (16, 32))  # t3: "cuda:0 f32[16, 32]"
    # t1 = prims.transpose(t_2, (1, 0))  # t1: "cuda:0 f32[16, 16]"
  del t_1, t_2
  t4 = torch.matmul(t1, t3)  # t4: "cuda:0 f32[16, 32]"
    # t4 = ltorch.matmul(t1, t3)  # t4: "cuda:0 f32[16, 32]"
      # t4 = prims.matmul(t1, t3)  # t4: "cuda:0 f32[16, 32]"
  del t1, t3
  [t6] = nvFusion1(t4)
    # t5 = prims.reshape(t4, (16, 16, 2))  # t5: "cuda:0 f32[16, 16, 2]"
    # t6 = prims.transpose(t5, (2, 1, 0))  # t6: "cuda:0 f32[2, 16, 16]"
  del t4
  return t6

That the transpose and reshape now does ended up being a memory copy.

kevinstephano commented 2 months ago

Can we measure whether this is a wallclock time (CPU overhead) or a kernel time regression? An NSight profile would show the comparison.

jjsjann123 commented 1 month ago

image

The example above gives the trace here. Ignoring thunder overhead and focus only on the impact from disabling bookend optimization in nvfuser executor

It's bloated with some extra latency from nvtx range, but looks like things surrounding nvfuser execute call isn't adding much overhead (I added two more marker on get_fd and execute for section in https://github.com/Lightning-AI/lightning-thunder/blob/bd18f0a1feeb0432a10bb17f45858c28cbd3ba29/thunder/executors/nvfuserex_impl.py#L401-L412).

The extra cpu latency (when bookend is disabled) is added by nvfuser. In terms of actual kernel time, the script runs tiny kernels and I'm not seeing any differences here at all.

Two follow up items:

  1. Even though there's only one element-wise kernel, we do have two nvFusion segments, one of which is a no-op session. (I'm wondering if there's any runtime that can be reduced for those).

  2. The einsum script here is just a toy example, part of the reason why we are getting hit hard by cpu latency. I'll try this with nano-gpt to see if there's any observed latency issue from a real benchmark.

jjsjann123 commented 1 month ago

Note to myself.

@wujingyue has some benchmark results in #731 . which is what I'm trying to evaluate in item 2 above.