Closed jjsjann123 closed 1 month ago
For the first action item, https://github.com/Lightning-AI/lightning-thunder/pull/206 triggers the following CI errors.
FAILED thunder/tests/test_examine_memory.py::test_nanogpt_block_nvfuser_cuda_float32 - AssertionError: assert 235985920 == 242277376
+ where 242277376 = sum(dict_values([6291456, 3072, 9216, 3072, 3072, 12288, 3072, 7077888, 9437184, 2359296, 9437184, 3072, 3072, 12582912, 18874368, 18874368, -18874368, 100663296, 100663296, -100663296, 6291456, 6291456, -6291456, 6291456, 12599296, -6291456, 25165824, 25165824, 6291456, 6291456, -6291456]))
+ where dict_values([6291456, 3072, 9216, 3072, 3072, 12288, 3072, 7077888, 9437184, 2359296, 9437184, 3072, 3072, 12582912, 18874368, 18874368, -18874368, 100663296, 100663296, -100663296, 6291456, 6291456, -6291456, 6291456, 12599296, -6291456, 25165824, 25165824, 6291456, 6291456, -6291456]) = <built-in method values of dict object at 0x7fc8e07a5600>()
+ where <built-in method values of dict object at 0x7fc8e07a5600> = {'del t16': -18874368, 'del t29': -100663296, 'del t49': -6291456, 'del t53': -6291456, ...}.values
FAILED thunder/tests/test_examine_memory.py::test_view_ops_nvfuser_cuda_float32 - assert 128 == 144
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_float64 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_float16 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_bool8 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_int64 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_bfloat16 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_int32 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_getitem_nvfuser_cuda_float32 - ValueError: Trying to extract a dtype from object slice(3, 1, None) with unknown type <class 'slice'>!
FAILED thunder/tests/test_jit_general.py::test_litgpt_variants[cuda-falcon-40b-like] - RuntimeError: inp->definition() && inp->definition()->isA<PadOp>() INTERNAL ASSERT FAILED at "/Fuser/csrc/preseg_passes/remove_empty.cpp":256, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Inputs to CatOp must be outputs of PadOps
Go figure...
getitem_nvfuser tests failed for a similar reason to https://github.com/Lightning-AI/lightning-thunder/blob/54bb6146ff757905925f8d9ea2197870c4971011/thunder/tests/opinfos.py#L3113-L3115. I can again create a wrapper so slice
objects don't get passed to FusionDefinitionWrapper. But I'd love to hear thoughts from @jjsjann123 and @kevinstephano who may have a better fix.
FAILED thunder/tests/test_examine_memory.py::test_view_ops_nvfuser_cuda_float32 - assert 128 == 144
is due to golden testing. 128 is less memory than 144, so it's in fact an improvement.
FAILED thunder/tests/test_jit_general.py::test_litgpt_variants[cuda-falcon-40b-like] - RuntimeError: inp->definition() && inp->definition()->isA<PadOp>() INTERNAL ASSERT FAILED at "/Fuser/csrc/preseg_passes/remove_empty.cpp":256, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Inputs to CatOp must be outputs of PadOps
is gone after I rebase.
However, on my workstation, the tolerance 1e-5 seems to be too small for both TOT and this PR (1e-3 seems to be large enough). I'll see what CI says.
$ pytest thunder/tests/test_jit_general.py -k test_litgpt_variants[cuda-falcon-40b-like]
========================================================================================================================================================================================================================================= test session starts =========================================================================================================================================================================================================================================
platform linux -- Python 3.10.12, pytest-8.1.1, pluggy-1.5.0
Test order randomisation NOT enabled. Enable with --random-order or --random-order-bucket=<bucket_type>
benchmark: 4.0.0 (defaults: timer=torch.utils.benchmark.utils.timer.timer disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=True warmup_iterations=100000)
rootdir: /opt/pytorch/lightning-thunder
configfile: pyproject.toml
plugins: timestamper-0.0.10, xdist-3.5.0, random-order-1.1.1, cov-4.1.0, benchmark-4.0.0, hypothesis-6.100.0, timeout-2.2.0, anyio-4.3.0, shard-0.1.2
timeout: 900.0s
timeout method: signal
timeout func_only: False
collected 68 items / 67 deselected / 1 selected
Running 1 items in this shard
thunder/tests/test_jit_general.py F [100%]
============================================================================================================================================================================================================================================== FAILURES ===============================================================================================================================================================================================================================================
_____________________________________________________________________________________________________________________________________________________________________________________________________________________________ test_litgpt_variants[cuda-falcon-40b-like] ______________________________________________________________________________________________________________________________________________________________________________________________________________________________
name = 'falcon-40b-like', device = device(type='cuda')
@skipif_not_pytorch_2_1
@pytest.mark.parametrize(
"name",
(
"gpt-neox-like",
"llama1-like",
"long-context-like",
"llama2-like",
"falcon-7b-like",
"falcon-40b-like",
"codellama2-like",
pytest.param(
"mixtral-like",
marks=pytest.mark.xfail(raises=(NotImplementedError, TypeError), reason="topk and where", strict=True),
),
),
)
@pytest.mark.parametrize(
"device",
("cpu", "cuda", "meta"),
)
def test_litgpt_variants(name, device):
if device == "cuda" and not torch.cuda.is_available():
pytest.skip("CUDA not available")
device = torch.device(device)
x = torch.randint(0, 200, (5, 5), device=device)
config = litgpt_model.Config.from_name(name)
with device:
reference = litgpt_model.GPT(config)
expected_logits = reference(x)
expected_logits.sum().backward()
with device:
model = litgpt_model.GPT(config)
model.load_state_dict(reference.state_dict())
tom = thunder.jit(model, executors=nvfuserex if device.type == "cuda" else torchex)
actual_logits = tom(x)
> assert_close(actual_logits, expected_logits)
E AssertionError: Tensor-likes are not close!
E
E Mismatched elements: 2153 / 12700 (17.0%)
E Greatest absolute difference: 0.0003186464309692383 at index (3, 3, 33) (up to 1e-05 allowed)
E Greatest relative difference: 0.4005630612373352 at index (1, 4, 143) (up to 1.3e-06 allowed)
thunder/tests/test_jit_general.py:654: AssertionError
======================================================================================================================================================================================================================================= short test summary info =======================================================================================================================================================================================================================================
FAILED thunder/tests/test_jit_general.py::test_litgpt_variants[cuda-falcon-40b-like] - AssertionError: Tensor-likes are not close!
============================================================================================================================================================================================================================ 1 failed, 67 deselected, 6 warnings in 7.70s =============================================================================================================================================================================================================================
Filed another blocker: https://github.com/Lightning-AI/lightning-thunder/issues/549
Yet-another blocker: https://github.com/NVIDIA/Fuser/issues/2362
These are all blockers that I can tell from the recent CI run. 🤞
The previous blockers have all been fixed. However, the most recent CI run failed with new errors -- number mismatches this time...
=========================== short test summary info ============================
FAILED thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_nvfuser_cuda_float16 - AssertionError: Tensor-likes are not close!
Mismatched elements: 139 / 114688 (0.1%)
Greatest absolute difference: 0.001220703125 at index (0, 0, 15, 53) (up to 1e-05 allowed)
Greatest relative difference: 0.115966796875 at index (2, 0, 17, 11) (up to 0.001 allowed)
FAILED thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_nvfuser_cuda_bfloat16 - AssertionError: Tensor-likes are not close!
Mismatched elements: 210005 / 212992 (98.6%)
Greatest absolute difference: 688.0 at index (0, 0, 48, 81) (up to 1e-05 allowed)
Greatest relative difference: inf at index (7, 0, 110, 32) (up to 0.016 allowed)
FAILED thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_torch_cuda_bfloat16 - AssertionError: Tensor-likes are not close!
Mismatched elements: 209468 / 212992 (98.3%)
Greatest absolute difference: 848.0 at index (5, 1, 56, 19) (up to 1e-05 allowed)
Greatest relative difference: inf at index (4, 1, 109, 70) (up to 0.016 allowed)
FAILED thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_torch_cuda_float16 - AssertionError: Tensor-likes are not close!
Mismatched elements: 210696 / 212992 (98.9%)
Greatest absolute difference: 704.0 at index (0, 1, 49, 13) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 2, 8) (up to 0.001 allowed)
FAILED thunder/tests/test_cudnn_executor.py::test_vjp_correctness_cudnn_sdpa[bfloat16-never-cat-grad-qkv] - AssertionError: Tensor-likes are not close!
Mismatched elements: 916 / 1310720 (0.1%)
Greatest absolute difference: 1.21875 at index (7, 1, 0, 96) (up to 0.2 allowed)
Greatest relative difference: inf at index (1, 1, 0, 2) (up to 0.02 allowed)
FAILED thunder/tests/test_cudnn_executor.py::test_vjp_correctness_cudnn_sdpa[float16-never-cat-grad-qkv] - AssertionError: Tensor-likes are not close!
Mismatched elements: 1582 / 1310720 (0.1%)
Greatest absolute difference: 1.0703125 at index (5, 0, 1, 39) (up to 0.2 allowed)
Greatest relative difference: inf at index (0, 1, 0, 2) (up to 0.02 allowed)
FAILED thunder/tests/test_cudnn_executor.py::test_vjp_correctness_cudnn_sdpa[bfloat16-may-cat-grad-qkv] - AssertionError: Tensor-likes are not close!
Mismatched elements: 1372 / 1310720 (0.1%)
Greatest absolute difference: 1.0859375 at index (6, 0, 1, 24) (up to 0.2 allowed)
Greatest relative difference: inf at index (1, 1, 0, 0) (up to 0.02 allowed)
FAILED thunder/tests/test_cudnn_executor.py::test_vjp_correctness_cudnn_sdpa[float16-may-cat-grad-qkv] - AssertionError: Tensor-likes are not close!
Mismatched elements: 1474 / 1310720 (0.1%)
Greatest absolute difference: 1.400390625 at index (2, 0, 1, 45) (up to 0.2 allowed)
Greatest relative difference: inf at index (0, 1, 0, 1) (up to 0.02 allowed)
= 8 failed, 4548 passed, 790 skipped, 91 xfailed, 93 xpassed, 110694 warnings in 975.39s (0:16:15) =
/usr/local/lib/python3.10/dist-packages/coverage/control.py:888: CoverageWarning: No data was collected. (no-data-collected)
self._warn("No data was collected.", slug="no-data-collected")
##[error]Bash exited with code '1'.
Finishing: Testing: regular
Good news: these number mismatches no longer show up after I resync.
Bad news: distributed tests start to fail.
One error is https://github.com/NVIDIA/Fuser/issues/2395.
The other error seems to be that https://github.com/Lightning-AI/lightning-thunder/blob/c21533c12a2a826aee84e011c415b216cb6f779d/thunder/tests/distributed/test_ddp.py#L771-L777 expects slices and pads to be in the top-level trace, which are however fused into an nvFusion. cc @crcrpar
Thanks for identifying latest status, Jingyue!
Good news: these number mismatches no longer show up after I resync.
🎉
Bad news: distributed tests start to fail.
😢
One error is NVIDIA/Fuser#2395.
The other error seems to be that
expects slices and pads to be in the top-level trace, which are however fused into an nvFusion. cc @crcrpar
Ahh, yeah that may have made sense before but less so now.
@crcrpar can we beg from your help here? Could you update the test to assert that the associated slice/pad are top level symbols or a fused executor took those ops? From thunder's point of view, either is acceptable.
I'll push more fixes tomorrow to tie up loose ends for CI.
Despite these pending CI failures, I was able to put some perf readings in https://github.com/Lightning-AI/lightning-thunder/pull/731, ~but I've yet to digest them.~ The results don't look good enough at this moment to merge.
Closed by #731
🚀 Feature
Bookend optimization was implemented when nvfuser makes unnecessary copy of tensor going through meta operations (view, reshape, transpose) that could be resolved through view.
Since nvfuser now does support alias operations, we would want to turn bookend optimization off by default.
Motivation
It is beneficial to have meta operations being inside fusion.
reshape
is a good example here.e.g. Reshape could ended up being a real data copy, we would want to fuse that with other operations. If
t3
is input to a fusion, we can fuse the permute/reshape with other operations; similarly, ift_2
is an output from fusion, codegen could make a better decision on whether to fuse the reshape with previous operation or to producet_2
in a format that the following permute+reshape does't trigger memory copy.Pitch
The promise we (@wujingyue) made earlier:
Actionable item:
cc @tfogal @apaz-cli