Disable bookend optimization in nvfuser

jjsjann123 commented 5 months ago

🚀 Feature

Bookend optimization was implemented when nvfuser makes unnecessary copy of tensor going through meta operations (view, reshape, transpose) that could be resolved through view.

Since nvfuser now does support alias operations, we would want to turn bookend optimization off by default.

Motivation

It is beneficial to have meta operations being inside fusion. reshape is a good example here.

  # t_2: "cuda:0 f32[2, 16, 8, 16]"
  t1 = torch.permute(t_2, [0, 2, 1, 3])  # t1: "cuda:0 f32[2, 8, 16, 16]"
    # t1 = ltorch.permute(t_2, [0, 2, 1, 3])  # t1: "cuda:0 f32[2, 8, 16, 16]"
      # t1 = prims.transpose(t_2, (0, 2, 1, 3))  # t1: "cuda:0 f32[2, 8, 16, 16]"
  del t_2
  t3 = torch.reshape(t1, (2, 8, 256))  # t3: "cuda:0 f32[2, 8, 256]"
    # t3 = ltorch.reshape(t1, (2, 8, 256))  # t3: "cuda:0 f32[2, 8, 256]"
      # t3 = prims.reshape(t1, (2, 8, 256))  # t3: "cuda:0 f32[2, 8, 256]"

e.g. Reshape could ended up being a real data copy, we would want to fuse that with other operations. If t3 is input to a fusion, we can fuse the permute/reshape with other operations; similarly, if t_2 is an output from fusion, codegen could make a better decision on whether to fuse the reshape with previous operation or to produce t_2 in a format that the following permute+reshape does't trigger memory copy.

Pitch

The promise we (@wujingyue) made earlier:

we'll bring more clarity to CPU latency and disable bookending with blast radius controlled.

Actionable item:

[ ] check functional issue with bookend disabled;
[ ] track einsum performance after disabling bookend optimization #193 ;
[ ] nvfuser backend host latency.

cc @tfogal @apaz-cli

wujingyue commented 5 months ago

For the first action item, https://github.com/Lightning-AI/lightning-thunder/pull/206 triggers the following CI errors.

FAILED thunder/tests/test_examine_memory.py::test_nanogpt_block_nvfuser_cuda_float32 - AssertionError: assert 235985920 == 242277376
 +  where 242277376 = sum(dict_values([6291456, 3072, 9216, 3072, 3072, 12288, 3072, 7077888, 9437184, 2359296, 9437184, 3072, 3072, 12582912, 18874368, 18874368, -18874368, 100663296, 100663296, -100663296, 6291456, 6291456, -6291456, 6291456, 12599296, -6291456, 25165824, 25165824, 6291456, 6291456, -6291456]))
 +    where dict_values([6291456, 3072, 9216, 3072, 3072, 12288, 3072, 7077888, 9437184, 2359296, 9437184, 3072, 3072, 12582912, 18874368, 18874368, -18874368, 100663296, 100663296, -100663296, 6291456, 6291456, -6291456, 6291456, 12599296, -6291456, 25165824, 25165824, 6291456, 6291456, -6291456]) = <built-in method values of dict object at 0x7fc8e07a5600>()
 +      where <built-in method values of dict object at 0x7fc8e07a5600> = {'del t16': -18874368, 'del t29': -100663296, 'del t49': -6291456, 'del t53': -6291456, ...}.values
FAILED thunder/tests/test_examine_memory.py::test_view_ops_nvfuser_cuda_float32 - assert 128 == 144
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_float64 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_float16 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_bool8 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_int64 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_bfloat16 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_int32 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_float32 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_jit_general.py::test_litgpt_variants[cuda-falcon-40b-like] - RuntimeError: inp->definition() && inp->definition()->isA<PadOp>() INTERNAL ASSERT FAILED at "/Fuser/csrc/preseg_passes/remove_empty.cpp":256, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Inputs to CatOp must be outputs of PadOps

Full log: https://dev.azure.com/Lightning-AI/lightning/_build/results?buildId=199110&view=logs&j=40d0d75b-9508-5bf3-1cc0-c16ca248b52e&t=796307ff-0be9-5add-4a05-17da597b1cc3

Go figure...