Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.16k stars 76 forks source link

Disable bookend optimization in nvfuser #191

Closed jjsjann123 closed 1 month ago

jjsjann123 commented 5 months ago

🚀 Feature

Bookend optimization was implemented when nvfuser makes unnecessary copy of tensor going through meta operations (view, reshape, transpose) that could be resolved through view.

Since nvfuser now does support alias operations, we would want to turn bookend optimization off by default.

Motivation

It is beneficial to have meta operations being inside fusion. reshape is a good example here.

  # t_2: "cuda:0 f32[2, 16, 8, 16]"
  t1 = torch.permute(t_2, [0, 2, 1, 3])  # t1: "cuda:0 f32[2, 8, 16, 16]"
    # t1 = ltorch.permute(t_2, [0, 2, 1, 3])  # t1: "cuda:0 f32[2, 8, 16, 16]"
      # t1 = prims.transpose(t_2, (0, 2, 1, 3))  # t1: "cuda:0 f32[2, 8, 16, 16]"
  del t_2
  t3 = torch.reshape(t1, (2, 8, 256))  # t3: "cuda:0 f32[2, 8, 256]"
    # t3 = ltorch.reshape(t1, (2, 8, 256))  # t3: "cuda:0 f32[2, 8, 256]"
      # t3 = prims.reshape(t1, (2, 8, 256))  # t3: "cuda:0 f32[2, 8, 256]"

e.g. Reshape could ended up being a real data copy, we would want to fuse that with other operations. If t3 is input to a fusion, we can fuse the permute/reshape with other operations; similarly, if t_2 is an output from fusion, codegen could make a better decision on whether to fuse the reshape with previous operation or to produce t_2 in a format that the following permute+reshape does't trigger memory copy.

Pitch

The promise we (@wujingyue) made earlier:

we'll bring more clarity to CPU latency and disable bookending with blast radius controlled.

Actionable item:

cc @tfogal @apaz-cli

wujingyue commented 5 months ago

For the first action item, https://github.com/Lightning-AI/lightning-thunder/pull/206 triggers the following CI errors.

FAILED thunder/tests/test_examine_memory.py::test_nanogpt_block_nvfuser_cuda_float32 - AssertionError: assert 235985920 == 242277376
 +  where 242277376 = sum(dict_values([6291456, 3072, 9216, 3072, 3072, 12288, 3072, 7077888, 9437184, 2359296, 9437184, 3072, 3072, 12582912, 18874368, 18874368, -18874368, 100663296, 100663296, -100663296, 6291456, 6291456, -6291456, 6291456, 12599296, -6291456, 25165824, 25165824, 6291456, 6291456, -6291456]))
 +    where dict_values([6291456, 3072, 9216, 3072, 3072, 12288, 3072, 7077888, 9437184, 2359296, 9437184, 3072, 3072, 12582912, 18874368, 18874368, -18874368, 100663296, 100663296, -100663296, 6291456, 6291456, -6291456, 6291456, 12599296, -6291456, 25165824, 25165824, 6291456, 6291456, -6291456]) = <built-in method values of dict object at 0x7fc8e07a5600>()
 +      where <built-in method values of dict object at 0x7fc8e07a5600> = {'del t16': -18874368, 'del t29': -100663296, 'del t49': -6291456, 'del t53': -6291456, ...}.values
FAILED thunder/tests/test_examine_memory.py::test_view_ops_nvfuser_cuda_float32 - assert 128 == 144
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_float64 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_float16 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_bool8 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_int64 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_bfloat16 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_int32 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_float32 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_jit_general.py::test_litgpt_variants[cuda-falcon-40b-like] - RuntimeError: inp->definition() && inp->definition()->isA<PadOp>() INTERNAL ASSERT FAILED at "/Fuser/csrc/preseg_passes/remove_empty.cpp":256, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Inputs to CatOp must be outputs of PadOps

Full log: https://dev.azure.com/Lightning-AI/lightning/_build/results?buildId=199110&view=logs&j=40d0d75b-9508-5bf3-1cc0-c16ca248b52e&t=796307ff-0be9-5add-4a05-17da597b1cc3

Go figure...

wujingyue commented 4 months ago

getitem_nvfuser tests failed for a similar reason to https://github.com/Lightning-AI/lightning-thunder/blob/54bb6146ff757905925f8d9ea2197870c4971011/thunder/tests/opinfos.py#L3113-L3115. I can again create a wrapper so slice objects don't get passed to FusionDefinitionWrapper. But I'd love to hear thoughts from @jjsjann123 and @kevinstephano who may have a better fix.

wujingyue commented 4 months ago

FAILED thunder/tests/test_examine_memory.py::test_view_ops_nvfuser_cuda_float32 - assert 128 == 144 is due to golden testing. 128 is less memory than 144, so it's in fact an improvement.

wujingyue commented 4 months ago

FAILED thunder/tests/test_jit_general.py::test_litgpt_variants[cuda-falcon-40b-like] - RuntimeError: inp->definition() && inp->definition()->isA<PadOp>() INTERNAL ASSERT FAILED at "/Fuser/csrc/preseg_passes/remove_empty.cpp":256, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Inputs to CatOp must be outputs of PadOps is gone after I rebase.

However, on my workstation, the tolerance 1e-5 seems to be too small for both TOT and this PR (1e-3 seems to be large enough). I'll see what CI says.

$ pytest thunder/tests/test_jit_general.py -k test_litgpt_variants[cuda-falcon-40b-like]
========================================================================================================================================================================================================================================= test session starts =========================================================================================================================================================================================================================================
platform linux -- Python 3.10.12, pytest-8.1.1, pluggy-1.5.0
Test order randomisation NOT enabled. Enable with --random-order or --random-order-bucket=<bucket_type>
benchmark: 4.0.0 (defaults: timer=torch.utils.benchmark.utils.timer.timer disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=True warmup_iterations=100000)
rootdir: /opt/pytorch/lightning-thunder
configfile: pyproject.toml
plugins: timestamper-0.0.10, xdist-3.5.0, random-order-1.1.1, cov-4.1.0, benchmark-4.0.0, hypothesis-6.100.0, timeout-2.2.0, anyio-4.3.0, shard-0.1.2
timeout: 900.0s
timeout method: signal
timeout func_only: False
collected 68 items / 67 deselected / 1 selected
Running 1 items in this shard

thunder/tests/test_jit_general.py F                                                                                                                                                                                                                                                                                                                                                                                                                                                             [100%]

============================================================================================================================================================================================================================================== FAILURES ===============================================================================================================================================================================================================================================
_____________________________________________________________________________________________________________________________________________________________________________________________________________________________ test_litgpt_variants[cuda-falcon-40b-like] ______________________________________________________________________________________________________________________________________________________________________________________________________________________________

name = 'falcon-40b-like', device = device(type='cuda')

    @skipif_not_pytorch_2_1
    @pytest.mark.parametrize(
        "name",
        (
            "gpt-neox-like",
            "llama1-like",
            "long-context-like",
            "llama2-like",
            "falcon-7b-like",
            "falcon-40b-like",
            "codellama2-like",
            pytest.param(
                "mixtral-like",
                marks=pytest.mark.xfail(raises=(NotImplementedError, TypeError), reason="topk and where", strict=True),
            ),
        ),
    )
    @pytest.mark.parametrize(
        "device",
        ("cpu", "cuda", "meta"),
    )
    def test_litgpt_variants(name, device):
        if device == "cuda" and not torch.cuda.is_available():
            pytest.skip("CUDA not available")

        device = torch.device(device)

        x = torch.randint(0, 200, (5, 5), device=device)
        config = litgpt_model.Config.from_name(name)

        with device:
            reference = litgpt_model.GPT(config)
        expected_logits = reference(x)

        expected_logits.sum().backward()

        with device:
            model = litgpt_model.GPT(config)
        model.load_state_dict(reference.state_dict())
        tom = thunder.jit(model, executors=nvfuserex if device.type == "cuda" else torchex)
        actual_logits = tom(x)
>       assert_close(actual_logits, expected_logits)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 2153 / 12700 (17.0%)
E       Greatest absolute difference: 0.0003186464309692383 at index (3, 3, 33) (up to 1e-05 allowed)
E       Greatest relative difference: 0.4005630612373352 at index (1, 4, 143) (up to 1.3e-06 allowed)

thunder/tests/test_jit_general.py:654: AssertionError
======================================================================================================================================================================================================================================= short test summary info =======================================================================================================================================================================================================================================
FAILED thunder/tests/test_jit_general.py::test_litgpt_variants[cuda-falcon-40b-like] - AssertionError: Tensor-likes are not close!
============================================================================================================================================================================================================================ 1 failed, 67 deselected, 6 warnings in 7.70s =============================================================================================================================================================================================================================
wujingyue commented 4 months ago

Filed another blocker: https://github.com/Lightning-AI/lightning-thunder/issues/549

wujingyue commented 4 months ago

Yet-another blocker: https://github.com/NVIDIA/Fuser/issues/2362

wujingyue commented 4 months ago

These are all blockers that I can tell from the recent CI run. 🤞

wujingyue commented 4 months ago

The previous blockers have all been fixed. However, the most recent CI run failed with new errors -- number mismatches this time...

=========================== short test summary info ============================
FAILED thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_nvfuser_cuda_float16 - AssertionError: Tensor-likes are not close!

Mismatched elements: 139 / 114688 (0.1%)
Greatest absolute difference: 0.001220703125 at index (0, 0, 15, 53) (up to 1e-05 allowed)
Greatest relative difference: 0.115966796875 at index (2, 0, 17, 11) (up to 0.001 allowed)
FAILED thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_nvfuser_cuda_bfloat16 - AssertionError: Tensor-likes are not close!

Mismatched elements: 210005 / 212992 (98.6%)
Greatest absolute difference: 688.0 at index (0, 0, 48, 81) (up to 1e-05 allowed)
Greatest relative difference: inf at index (7, 0, 110, 32) (up to 0.016 allowed)
FAILED thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_torch_cuda_bfloat16 - AssertionError: Tensor-likes are not close!

Mismatched elements: 209468 / 212992 (98.3%)
Greatest absolute difference: 848.0 at index (5, 1, 56, 19) (up to 1e-05 allowed)
Greatest relative difference: inf at index (4, 1, 109, 70) (up to 0.016 allowed)
FAILED thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_torch_cuda_float16 - AssertionError: Tensor-likes are not close!

Mismatched elements: 210696 / 212992 (98.9%)
Greatest absolute difference: 704.0 at index (0, 1, 49, 13) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 2, 8) (up to 0.001 allowed)
FAILED thunder/tests/test_cudnn_executor.py::test_vjp_correctness_cudnn_sdpa[bfloat16-never-cat-grad-qkv] - AssertionError: Tensor-likes are not close!

Mismatched elements: 916 / 1310720 (0.1%)
Greatest absolute difference: 1.21875 at index (7, 1, 0, 96) (up to 0.2 allowed)
Greatest relative difference: inf at index (1, 1, 0, 2) (up to 0.02 allowed)
FAILED thunder/tests/test_cudnn_executor.py::test_vjp_correctness_cudnn_sdpa[float16-never-cat-grad-qkv] - AssertionError: Tensor-likes are not close!

Mismatched elements: 1582 / 1310720 (0.1%)
Greatest absolute difference: 1.0703125 at index (5, 0, 1, 39) (up to 0.2 allowed)
Greatest relative difference: inf at index (0, 1, 0, 2) (up to 0.02 allowed)
FAILED thunder/tests/test_cudnn_executor.py::test_vjp_correctness_cudnn_sdpa[bfloat16-may-cat-grad-qkv] - AssertionError: Tensor-likes are not close!

Mismatched elements: 1372 / 1310720 (0.1%)
Greatest absolute difference: 1.0859375 at index (6, 0, 1, 24) (up to 0.2 allowed)
Greatest relative difference: inf at index (1, 1, 0, 0) (up to 0.02 allowed)
FAILED thunder/tests/test_cudnn_executor.py::test_vjp_correctness_cudnn_sdpa[float16-may-cat-grad-qkv] - AssertionError: Tensor-likes are not close!

Mismatched elements: 1474 / 1310720 (0.1%)
Greatest absolute difference: 1.400390625 at index (2, 0, 1, 45) (up to 0.2 allowed)
Greatest relative difference: inf at index (0, 1, 0, 1) (up to 0.02 allowed)
= 8 failed, 4548 passed, 790 skipped, 91 xfailed, 93 xpassed, 110694 warnings in 975.39s (0:16:15) =
/usr/local/lib/python3.10/dist-packages/coverage/control.py:888: CoverageWarning: No data was collected. (no-data-collected)
  self._warn("No data was collected.", slug="no-data-collected")

##[error]Bash exited with code '1'.
Finishing: Testing: regular
wujingyue commented 4 months ago

Good news: these number mismatches no longer show up after I resync.

Bad news: distributed tests start to fail.

One error is https://github.com/NVIDIA/Fuser/issues/2395.

The other error seems to be that https://github.com/Lightning-AI/lightning-thunder/blob/c21533c12a2a826aee84e011c415b216cb6f779d/thunder/tests/distributed/test_ddp.py#L771-L777 expects slices and pads to be in the top-level trace, which are however fused into an nvFusion. cc @crcrpar

tfogal commented 4 months ago

Thanks for identifying latest status, Jingyue!

Good news: these number mismatches no longer show up after I resync.

🎉

Bad news: distributed tests start to fail.

😢

One error is NVIDIA/Fuser#2395.

The other error seems to be that

https://github.com/Lightning-AI/lightning-thunder/blob/c21533c12a2a826aee84e011c415b216cb6f779d/thunder/tests/distributed/test_ddp.py#L771-L777

expects slices and pads to be in the top-level trace, which are however fused into an nvFusion. cc @crcrpar

Ahh, yeah that may have made sense before but less so now.

@crcrpar can we beg from your help here? Could you update the test to assert that the associated slice/pad are top level symbols or a fused executor took those ops? From thunder's point of view, either is acceptable.

wujingyue commented 3 months ago

I'll push more fixes tomorrow to tie up loose ends for CI.

Despite these pending CI failures, I was able to put some perf readings in https://github.com/Lightning-AI/lightning-thunder/pull/731, ~but I've yet to digest them.~ The results don't look good enough at this moment to merge.

wujingyue commented 1 month ago

Closed by #731