Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.16k stars 77 forks source link

ThunderFX failure: KeyError: 'l_stack0_' #1293

Open IvanYashchuk opened 1 day ago

IvanYashchuk commented 1 day ago

On the latest Thunder (dafc79d21c04769c5e9d1fb737b8cd21d0841e69) I tried the snippet from https://github.com/Lightning-AI/lightning-thunder/issues/1174#issuecomment-2383823134 and got the following error:

File ~/dev/lightning-thunder/thunder/dynamo/compiler.py:81, in ThunderCompiler.__call__(self, gm, sample_args)
     77 recompile_graph(gm)
     79 # The whole graph may not be supported by `thunder`, so we split it in `thunder` supported sections
     80 # and unsupported sections which are passed to `torch.compile(backend='inductor')`
---> 81 split_module, subgraph_info = _splitter(gm, self._thunder_jit, self._torch_compile, sample_args)
     82 self.subgraph_infos.append(subgraph_info)
     83 return split_module

File ~/dev/lightning-thunder/thunder/dynamo/splitter.py:133, in _splitter(gm, thunder_jit, torch_inductor, _unused_sample_args)
    130     return partition_cnt
    132 # `split_module` iterates over nodes and determines the partition to place them based on the callback.
--> 133 split_gm: torch.fx.GraphModule = split_module(
    134     gm, root_m=None, split_callback=callback, keep_original_order=True, keep_original_node_name=True
    135 )
    137 def is_thunder_supported_partition(node: torch.fx.Node) -> bool:
    138     return node.name.startswith("submod") and int(node.name.replace("submod_", "")) in supported_partitions

File ~/dev/pytorch/main/torch/fx/passes/split_module.py:513, in split_module.<locals>.<lambda>(n)
    510 for node in m.graph.nodes:
    511     if node.op == "output":
    512         base_mod_graph.output(
--> 513             torch.fx.graph.map_arg(node.args[0], lambda n: base_mod_env[n.name])
    514         )  # noqa: B950
    516 return torch.fx.graph_module.GraphModule(base_mod_attrs, base_mod_graph)

BackendCompilerFailed: backend='<thunder.dynamo.compiler.ThunderCompiler object at 0x7fd27cb43310>' raised:
KeyError: 'l_stack0_'

My PyTorch version is '2.4.0a0+git3827810'.

IvanYashchuk commented 1 day ago

@kshitij12345, could you please take a look? Is it Thunder's problem or PyTorch FX's problem?

kshitij12345 commented 1 day ago

I am seeing two separate failures in different environment (with more recent PyTorch) both of which seem to occur after the splitting phase. So, I think this particular KeyError is coming from fx.split_module which has been fixed in PyTorch core.

Env 1 - In local setup, PyTorch version 2.5.0a0+git08b5e07, nvFuser version 0.2.13+git84e5d23 and thunder commit 30e4aa1e67005c58219d7f06b46836eedb74b27a, I am seeing

File "/home/kkalambarkar/lightning-thunder/thunder/core/trace_interpreter.py", line 63, in interpret_trace
    prim_func = symbol_mapper(symbol) if symbol_mapper is not None else symbol.sym
  File "/home/kkalambarkar/lightning-thunder/thunder/core/transforms.py", line 2514, in vjp_symbol_mapper
    raise NotImplementedError(f"VJP for {symbol.sym.id} is not implemented")
NotImplementedError: VJP for PrimIDs.COPY_WITH_SETITEM is not implemented

We already have an issue for the same - https://github.com/Lightning-AI/lightning-thunder/issues/1240

Env 2 (with latest versions of PyTorch, nvFuser and thunder) - Using the internal docker image with PyTorch version 2.6.0a0+gita777dea, nvFuser version 0.2.15+git7616b54 and thunder commit 30e4aa1e67005c58219d7f06b46836eedb74b27a, I am seeing error from nvFuser. Will file an issue with nvFuser for the same. EDIT - Issue filed at https://github.com/NVIDIA/Fuser/issues/3176

nvFuser Error ```python An error occurred while executing nvFuser FusionDefinition 32. If you believe this is a bug or need assistance, please file an issue at https://github.com/NVIDIA/Fuser/issues/new Here's a script to reproduce the error: ```python # CUDA devices: # 0: NVIDIA RTX 6000 Ada Generation # 1: NVIDIA RTX 6000 Ada Generation # torch version: 2.6.0a0+gita777dea # cuda version: 12.6 # nvfuser version: 0.2.15+git7616b54 import torch from nvfuser import FusionDefinition, DataType def nvfuser_fusion_id32(fd : FusionDefinition) -> None : T0 = fd.define_tensor(shape=[1, 597, 128], contiguity=[None, True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[2, 1, 0]) T1 = fd.define_tensor(shape=[1, 32, 597, 128], contiguity=[None, True, True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[3, 1, 2, 0]) T2 = fd.define_tensor(shape=[1, 32, 597, 128], contiguity=[None, True, True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[3, 2, 1, 0]) T3 = fd.define_tensor(shape=[1, 32, 597, 128], contiguity=[None, True, True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[3, 1, 2, 0]) T4 = fd.define_tensor(shape=[1, 597, 128], contiguity=[None, True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[2, 1, 0]) T5 = fd.define_tensor(shape=[1, 32, 597, 128], contiguity=[None, True, True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[3, 1, 2, 0]) T6 = fd.define_tensor(shape=[1, 32, 597, 128], contiguity=[None, True, True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[3, 2, 1, 0]) S7 = fd.define_scalar(1, dtype=DataType.Int) S8 = fd.define_scalar(1, dtype=DataType.Int) S9 = fd.define_scalar(597, dtype=DataType.Int) S10 = fd.define_scalar(128, dtype=DataType.Int) T12 = fd.ops.broadcast_in_dim(T0, shape=[S7, S8, S9, S10], broadcast_dims=[0, 2, 3]) S13 = fd.define_scalar(1, dtype=DataType.Int) S14 = fd.define_scalar(32, dtype=DataType.Int) S15 = fd.define_scalar(597, dtype=DataType.Int) S16 = fd.define_scalar(128, dtype=DataType.Int) T18 = fd.ops.broadcast_in_dim(T12, shape=[S13, S14, S15, S16], broadcast_dims=[0, 1, 2, 3]) T19 = fd.ops.cast(T1, dtype=DataType.Float) T20 = fd.ops.cast(T2, dtype=DataType.Float) T21 = fd.ops.cast(T3, dtype=DataType.Float) T22 = fd.ops.cast(T18, dtype=DataType.Float) T23 = fd.ops.add(T20, T19) T24 = fd.ops.mul(T22, T21) T25 = fd.ops.mul(T22, T23) T26 = fd.ops.cast(T24, dtype=DataType.BFloat16) T27 = fd.ops.cast(T25, dtype=DataType.BFloat16) T43 = fd.ops.slice(T26, start_indices=[0, 0, 0, 0], end_indices=[1, 32, 597, 64], strides=[1, 1, 1, 1]) T59 = fd.ops.slice(T27, start_indices=[0, 0, 0, 0], end_indices=[1, 32, 597, 64], strides=[1, 1, 1, 1]) T60 = fd.ops.cast(T43, dtype=DataType.Float) T61 = fd.ops.cast(T59, dtype=DataType.Float) T62 = fd.ops.neg(T60) T63 = fd.ops.neg(T61) S64 = fd.define_scalar(1, dtype=DataType.Int) S65 = fd.define_scalar(1, dtype=DataType.Int) S66 = fd.define_scalar(597, dtype=DataType.Int) S67 = fd.define_scalar(128, dtype=DataType.Int) T69 = fd.ops.broadcast_in_dim(T4, shape=[S64, S65, S66, S67], broadcast_dims=[0, 2, 3]) T85 = fd.ops.slice(T26, start_indices=[0, 0, 0, 64], end_indices=[1, 32, 597, 128], strides=[1, 1, 1, 1]) T86 = fd.ops.cast(T62, dtype=DataType.BFloat16) T102 = fd.ops.slice(T27, start_indices=[0, 0, 0, 64], end_indices=[1, 32, 597, 128], strides=[1, 1, 1, 1]) T103 = fd.ops.cast(T63, dtype=DataType.BFloat16) S104 = fd.define_scalar(1, dtype=DataType.Int) S105 = fd.define_scalar(32, dtype=DataType.Int) S106 = fd.define_scalar(597, dtype=DataType.Int) S107 = fd.define_scalar(128, dtype=DataType.Int) T109 = fd.ops.broadcast_in_dim(T69, shape=[S104, S105, S106, S107], broadcast_dims=[0, 1, 2, 3]) S110 = fd.define_scalar(0.00000, dtype=DataType.Double) T111 = fd.ops.pad(T85, [0, 64, 0, 0, 0, 0, 0, 0], S110) S112 = fd.define_scalar(0.00000, dtype=DataType.Double) T113 = fd.ops.pad(T86, [64, 0, 0, 0, 0, 0, 0, 0], S112) S114 = fd.define_scalar(0.00000, dtype=DataType.Double) T115 = fd.ops.pad(T102, [0, 64, 0, 0, 0, 0, 0, 0], S114) S116 = fd.define_scalar(0.00000, dtype=DataType.Double) T117 = fd.ops.pad(T103, [64, 0, 0, 0, 0, 0, 0, 0], S116) T118 = fd.ops.cast(T109, dtype=DataType.Float) T119 = fd.ops.cast(T111, dtype=DataType.Float) T120 = fd.ops.cast(T113, dtype=DataType.Float) T121 = fd.ops.cast(T115, dtype=DataType.Float) T122 = fd.ops.cast(T117, dtype=DataType.Float) T123 = fd.ops.mul(T118, T21) T124 = fd.ops.add(T120, T119) T125 = fd.ops.mul(T118, T23) T126 = fd.ops.add(T122, T121) T127 = fd.ops.cast(T5, dtype=DataType.Float) T128 = fd.ops.cast(T6, dtype=DataType.Float) T129 = fd.ops.add(T124, T123) T130 = fd.ops.add(T126, T125) T131 = fd.ops.add(T128, T127) T132 = fd.ops.cast(T129, dtype=DataType.BFloat16) T133 = fd.ops.cast(T130, dtype=DataType.BFloat16) T134 = fd.ops.cast(T131, dtype=DataType.BFloat16) T135 = fd.ops.permute(T132, dims=[0, 2, 1, 3]) T136 = fd.ops.permute(T133, dims=[0, 2, 1, 3]) T137 = fd.ops.permute(T134, dims=[0, 2, 1, 3]) T142 = fd.ops.reshape(T135, new_shape=[1, 597, 4096]) T147 = fd.ops.reshape(T136, new_shape=[1, 597, 4096]) T152 = fd.ops.reshape(T137, new_shape=[1, 597, 4096]) T156 = fd.ops.reshape(T142, new_shape=[597, 4096]) T160 = fd.ops.reshape(T147, new_shape=[597, 4096]) T164 = fd.ops.reshape(T152, new_shape=[597, 4096]) T165 = fd.ops.permute(T156, dims=[1, 0]) T166 = fd.ops.permute(T160, dims=[1, 0]) T167 = fd.ops.permute(T164, dims=[1, 0]) fd.add_output(T164) fd.add_output(T167) fd.add_output(T160) fd.add_output(T166) fd.add_output(T156) fd.add_output(T165) with FusionDefinition() as fd: nvfuser_fusion_id32(fd) inputs = [ torch.randn(76416, dtype=torch.bfloat16, device='cuda:0').as_strided((1, 597, 128), (76416, 128, 1)), torch.randn(2445312, dtype=torch.bfloat16, device='cuda:0').as_strided((1, 32, 597, 128), (2445312, 128, 4096, 1)), torch.randn(2445312, dtype=torch.bfloat16, device='cuda:0').as_strided((1, 32, 597, 128), (2445312, 76416, 128, 1)), torch.randn(2445312, dtype=torch.bfloat16, device='cuda:0').as_strided((1, 32, 597, 128), (2445312, 128, 4096, 1)), torch.randn(76416, dtype=torch.bfloat16, device='cuda:0').as_strided((1, 597, 128), (76416, 128, 1)), torch.randn(2445312, dtype=torch.bfloat16, device='cuda:0').as_strided((1, 32, 597, 128), (2445312, 128, 4096, 1)), torch.randn(2445312, dtype=torch.bfloat16, device='cuda:0').as_strided((1, 32, 597, 128), (2445312, 76416, 128, 1)), ] fd.execute(inputs) ``` Traceback (most recent call last): File "/opt/pytorch/nvfuser/nvfuser/__init__.py", line 181, in execute results = self._execute( RuntimeError: INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/runtime/fusion_kernel_runtime.cpp":358, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Detected exception while compiling fusion segments in parallel. Error messages from all threads are printed below. Error from segmentation group 10: INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/index_compute.cpp":1965, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Couldn't find allocation mapping for T71_l___bfloat[ iblockIdx.x488{( ceilDiv(( ceilDiv(128, 4) ), blockDim.x) )}, iblockIdx.y491{( ceilDiv(( 1 * 597 ), 1) )}, iUS492{1}, iV487{4}, ithreadIdx.x489{blockDim.x} ] ca_pos( 3 ) dim: 2 id: iS301{128}, loops: iblockIdx.x335{( ceilDiv(( ceilDiv(4096, 4) ), blockDim.x) )} iblockIdx.y435{( ceilDiv(( 1 * 597 ), 1) )} iUS436{1} iV487{4} ithreadIdx.x489{blockDim.x} Exception raised from getNonGlobalConsumerStridedIndices at /opt/pytorch/nvfuser/csrc/index_compute.cpp:1965 (most recent call first): frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string, std::allocator > const&) + 0xf3 (0x7623932bd099 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string, std::allocator > const&) + 0x53 (0x76239366f303 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #2: + 0x5d9518 (0x762393758518 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #3: + 0x5d981f (0x76239375881f in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #4: + 0x5d9e70 (0x762393758e70 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #5: + 0x41df4e (0x76239359cf4e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #6: + 0x425aeb (0x7623935a4aeb in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #7: + 0x4273bf (0x7623935a63bf in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #8: + 0x4243ef (0x7623935a33ef in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #9: + 0x4243ef (0x7623935a33ef in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #10: + 0x4243ef (0x7623935a33ef in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #11: + 0x4273bf (0x7623935a63bf in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #12: + 0x4243ef (0x7623935a33ef in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #13: + 0x4243ef (0x7623935a33ef in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #14: + 0x41d8bb (0x76239359c8bb in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #15: + 0x3eb0ae (0x76239356a0ae in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #16: nvfuser::GpuLower::run() + 0x239 (0x762393565fe9 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #17: nvfuser::FusionExecutor::compileFusion(nvfuser::Fusion*, nvfuser::KernelArgumentHolder const&, nvfuser::LaunchParams const&, nvfuser::CompileParams, nvfuser::SchedulerType, long, long, long, long) + 0xa8a (0x762393938b1a in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #18: + 0x7f1572 (0x762393970572 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #19: + 0x7f183c (0x76239397083c in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #20: c10::ThreadPool::main_loop(unsigned long) + 0x2bd (0x762478d87fbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #21: + 0xdc253 (0x7624bf198253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #22: + 0x94ac3 (0x7624bf37eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #23: + 0x126850 (0x7624bf410850 in /usr/lib/x86_64-linux-gnu/libc.so.6) Use NVFUSER_DISABLE=parallel_compile to simplify error message. Exception raised from compileFusionParallel at /opt/pytorch/nvfuser/csrc/runtime/fusion_kernel_runtime.cpp:358 (most recent call first): frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string, std::allocator > const&) + 0xf3 (0x7623932bd099 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string, std::allocator > const&) + 0x53 (0x76239366f303 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #2: nvfuser::FusionKernelRuntime::compileFusionParallel(nvfuser::KernelArgumentHolder) + 0xf9d (0x76239397213d in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #3: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef const&, std::optional, std::optional) + 0x1cb (0x76239396befb in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #4: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef const&, std::optional, bool, bool, bool) const + 0x796 (0x762393ad9d76 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #5: + 0x1ca17e (0x76239334917e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #6: + 0x24838f (0x7623933c738f in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #7: + 0x2dd850 (0x76239345c850 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #32: torch::autograd::PyNode::apply(std::vector >&&) + 0x95 (0x7624b21b3c75 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) frame #33: + 0x4fb908b (0x7624a992708b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #34: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr const&) + 0xfd6 (0x7624a99214a6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #35: torch::autograd::Engine::thread_main(std::shared_ptr const&) + 0x56c (0x7624a99224ec in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #36: torch::autograd::Engine::thread_init(int, std::shared_ptr const&, bool) + 0x2ad (0x7624a991aa7d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #37: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr const&, bool) + 0x75 (0x7624b21ae685 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) frame #38: + 0xdc253 (0x7624bf198253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #39: + 0x94ac3 (0x7624bf37eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #40: + 0x126850 (0x7624bf410850 in /usr/lib/x86_64-linux-gnu/libc.so.6) Traceback (most recent call last): File "/opt/pytorch/lightning-thunder/test.py", line 28, in out.loss.backward() File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 624, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 347, in backward _engine_run_backward( File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 825, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 307, in apply return user_fn(self, *args) File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 600, in wrapper outputs = fn(ctx, *args) File "/opt/pytorch/lightning-thunder/thunder/executors/torch_autograd.py", line 96, in backward grads = ctx.compiled_backward([saved_tensors_list, ctx.saved_other], args) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast return func(*args, **kwargs) File "thunder.backward_fn_140", line 451, in backward_fn File "/opt/pytorch/lightning-thunder/thunder/executors/nvfuserex_impl.py", line 456, in __call__ return fd.execute(args, **kwargs) File "/opt/pytorch/nvfuser/nvfuser/__init__.py", line 181, in execute results = self._execute( RuntimeError: INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/runtime/fusion_kernel_runtime.cpp":358, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Detected exception while compiling fusion segments in parallel. Error messages from all threads are printed below. Error from segmentation group 10: INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/index_compute.cpp":1965, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Couldn't find allocation mapping for T71_l___bfloat[ iblockIdx.x488{( ceilDiv(( ceilDiv(128, 4) ), blockDim.x) )}, iblockIdx.y491{( ceilDiv(( 1 * 597 ), 1) )}, iUS492{1}, iV487{4}, ithreadIdx.x489{blockDim.x} ] ca_pos( 3 ) dim: 2 id: iS301{128}, loops: iblockIdx.x335{( ceilDiv(( ceilDiv(4096, 4) ), blockDim.x) )} iblockIdx.y435{( ceilDiv(( 1 * 597 ), 1) )} iUS436{1} iV487{4} ithreadIdx.x489{blockDim.x} Exception raised from getNonGlobalConsumerStridedIndices at /opt/pytorch/nvfuser/csrc/index_compute.cpp:1965 (most recent call first): frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string, std::allocator > const&) + 0xf3 (0x7623932bd099 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string, std::allocator > const&) + 0x53 (0x76239366f303 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #2: + 0x5d9518 (0x762393758518 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #3: + 0x5d981f (0x76239375881f in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #4: + 0x5d9e70 (0x762393758e70 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #5: + 0x41df4e (0x76239359cf4e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #6: + 0x425aeb (0x7623935a4aeb in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #7: + 0x4273bf (0x7623935a63bf in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #8: + 0x4243ef (0x7623935a33ef in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #9: + 0x4243ef (0x7623935a33ef in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #10: + 0x4243ef (0x7623935a33ef in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #11: + 0x4273bf (0x7623935a63bf in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #12: + 0x4243ef (0x7623935a33ef in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #13: + 0x4243ef (0x7623935a33ef in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #14: + 0x41d8bb (0x76239359c8bb in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #15: + 0x3eb0ae (0x76239356a0ae in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #16: nvfuser::GpuLower::run() + 0x239 (0x762393565fe9 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #17: nvfuser::FusionExecutor::compileFusion(nvfuser::Fusion*, nvfuser::KernelArgumentHolder const&, nvfuser::LaunchParams const&, nvfuser::CompileParams, nvfuser::SchedulerType, long, long, long, long) + 0xa8a (0x762393938b1a in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #18: + 0x7f1572 (0x762393970572 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #19: + 0x7f183c (0x76239397083c in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #20: c10::ThreadPool::main_loop(unsigned long) + 0x2bd (0x762478d87fbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #21: + 0xdc253 (0x7624bf198253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #22: + 0x94ac3 (0x7624bf37eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #23: + 0x126850 (0x7624bf410850 in /usr/lib/x86_64-linux-gnu/libc.so.6) Use NVFUSER_DISABLE=parallel_compile to simplify error message. Exception raised from compileFusionParallel at /opt/pytorch/nvfuser/csrc/runtime/fusion_kernel_runtime.cpp:358 (most recent call first): frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string, std::allocator > const&) + 0xf3 (0x7623932bd099 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string, std::allocator > const&) + 0x53 (0x76239366f303 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #2: nvfuser::FusionKernelRuntime::compileFusionParallel(nvfuser::KernelArgumentHolder) + 0xf9d (0x76239397213d in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #3: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef const&, std::optional, std::optional) + 0x1cb (0x76239396befb in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #4: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef const&, std::optional, bool, bool, bool) const + 0x796 (0x762393ad9d76 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #5: + 0x1ca17e (0x76239334917e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #6: + 0x24838f (0x7623933c738f in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #7: + 0x2dd850 (0x76239345c850 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #32: torch::autograd::PyNode::apply(std::vector >&&) + 0x95 (0x7624b21b3c75 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) frame #33: + 0x4fb908b (0x7624a992708b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #34: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr const&) + 0xfd6 (0x7624a99214a6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #35: torch::autograd::Engine::thread_main(std::shared_ptr const&) + 0x56c (0x7624a99224ec in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #36: torch::autograd::Engine::thread_init(int, std::shared_ptr const&, bool) + 0x2ad (0x7624a991aa7d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #37: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr const&, bool) + 0x75 (0x7624b21ae685 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) frame #38: + 0xdc253 (0x7624bf198253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #39: + 0x94ac3 (0x7624bf37eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #40: + 0x126850 (0x7624bf410850 in /usr/lib/x86_64-linux-gnu/libc.so.6) ```