NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
250 stars 49 forks source link

RuntimeError: hasCompiledKernel() INTERNAL ASSERT FAILED at "Fuser/csrc/executor.cpp":1864 #2642

Open t-vi opened 1 month ago

t-vi commented 1 month ago

Might not be much of a priority to get this fast, but sending a dequant op through Thunder crashes which it should not.

An error occurred while executing nvFuser FusionDefinition 1.

import torch
from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id1(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[-1, -1, -1], contiguity=[True, True, None], dtype=DataType.Int, is_cpu=False, stride_order=[2, 1, 0])
    T1 = fd.define_tensor(shape=[-1], contiguity=[True], dtype=DataType.Int, is_cpu=False, stride_order=[0])
    T2 = fd.ops.signbit(T0)
    T3 = fd.ops.signbit(T1)
    S4 = fd.define_scalar(4, dtype=DataType.Int)
    S5 = fd.define_scalar(4, dtype=DataType.Int)
    S6 = fd.define_scalar(2, dtype=DataType.Int)
    V7 = fd.define_vector([S4, S5, S6], dtype=DataType.Int)
    T8 = fd.ops.broadcast_in_dim(T3, shape=V7, broadcast_dims=[2])
    T9 = fd.ops.ne(T2, T8)
    S10 = fd.define_scalar(4, dtype=DataType.Int)
    S11 = fd.define_scalar(4, dtype=DataType.Int)
    S12 = fd.define_scalar(2, dtype=DataType.Int)
    V13 = fd.define_vector([S10, S11, S12], dtype=DataType.Int)
    T14 = fd.ops.broadcast_in_dim(T1, shape=V13, broadcast_dims=[2])
    T15 = fd.ops.fmod(T0, T14)
    S16 = fd.define_scalar(0, dtype=DataType.Int)
    T17 = fd.ops.ne(T15, S16)
    T18 = fd.ops.bitwise_and(T9, T17)
    T19 = fd.ops.reciprocal(T14)
    T20 = fd.ops.mul(T0, T19)
    T21 = fd.ops.cast(T18, dtype=DataType.Int)
    T22 = fd.ops.sub(T20, T21)
    S23 = fd.define_scalar(15, dtype=DataType.Int)
    T24 = fd.ops.bitwise_and(T22, S23)
    fd.add_output(T24)

with FusionDefinition() as fd:
    nvfuser_fusion_id1(fd)

inputs = [
    torch.randint(0, 10, (16,), dtype=torch.int64, device='cuda:0').as_strided((4, 4, 2), (4, 1, 0)),
    torch.randint(0, 10, (2,), dtype=torch.int64, device='cuda:0').as_strided((2,), (1,)),
]
fd.execute(inputs)

Traceback (most recent call last):

  File "/usr/local/lib/python3.11/dist-packages/nvfuser/__init__.py", line 145, in execute
    result = self._execute(
             ^^^^^^^^^^^^^^
RuntimeError: hasCompiledKernel() INTERNAL ASSERT FAILED at "Fuser/csrc/executor.cpp":1864, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Cannot set dynamic smem size unless kernel is compiled
Exception raised from ensureAvailableDynamicSmemSize at Fuser/csrc/executor.cpp:1864 (most recent call first):
frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xc2 (0x7f88703148c5 in /usr/local/lib/python3.11/dist-packages/nvfuser/_C.cpython-311-x86_64-linux-gnu.so)
frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3e (0x7f887061926e in /usr/local/lib/python3.11/dist-packages/nvfuser/_C.cpython-311-x86_64-linux-gnu.so)
frame #2: <unknown function> + 0x41d0b6 (0x7f887061d0b6 in /usr/local/lib/python3.11/dist-packages/nvfuser/_C.cpython-311-x86_64-linux-gnu.so)
frame #3: nvfuser::FusionExecutor::runFusion(nvfuser::KernelArgumentHolder&, nvfuser::LaunchParams const&, nvfuser::CompileParams, std::vector<at::Tensor, std::allocator<at::Tensor> >) + 0x24d3 (0x7f887062e8b3 in /usr/local/lib/python3.11/dist-packages/nvfuser/_C.cpython-311-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x601dbc (0x7f8870801dbc in /usr/local/lib/python3.11/dist-packages/nvfuser/_C.cpython-311-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x6080c7 (0x7f88708080c7 in /usr/local/lib/python3.11/dist-packages/nvfuser/_C.cpython-311-x86_64-linux-gnu.so)
frame #6: nvfuser::FusionKernelRuntime::runWithInputs(nvfuser::KernelArgumentHolder&) + 0x96 (0x7f8870808f76 in /usr/local/lib/python3.11/dist-packages/nvfuser/_C.cpython-311-x86_64-linux-gnu.so)
frame #7: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0x3f6 (0x7f88708144f6 in /usr/local/lib/python3.11/dist-packages/nvfuser/_C.cpython-311-x86_64-linux-gnu.so)
frame #8: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, std::optional<signed char>, bool, bool, bool) const + 0x257 (0x7f8870a0e457 in /usr/local/lib/python3.11/dist-packages/nvfuser/_C.cpython-311-x86_64-linux-gnu.so)
frame #9: <unknown function> + 0x16f6fe (0x7f887036f6fe in /usr/local/lib/python3.11/dist-packages/nvfuser/_C.cpython-311-x86_64-linux-gnu.so)
frame #10: <unknown function> + 0x2134ef (0x7f88704134ef in /usr/local/lib/python3.11/dist-packages/nvfuser/_C.cpython-311-x86_64-linux-gnu.so)
frame #11: <unknown function> + 0x268057 (0x7f8870468057 in /usr/local/lib/python3.11/dist-packages/nvfuser/_C.cpython-311-x86_64-linux-gnu.so)
nikitaved commented 1 month ago

It fails for me with a compilation error. Here is the problematic bit:

...
    float f3;                                                                                                                                                                                                                                                                     
    f3 = (float)(15LL);
...
    float T18[1];                                                   
    T18[0]                                                          
      = T15[0]                                                      
      - T17[0];                                                     
    float T22[1];                                                   
    T22[0]                                                          
      = T18[0]                                                      
      & f3;    

Note that it tries to use bitwise and & on float types when doing T18[0] & f3.

EDIT:

Take a look at the following part:

    T0 = ... integer  tensor ...
    T19 = fd.ops.reciprocal(T14)
    T20 = fd.ops.mul(T0, T19)
    T21 = fd.ops.cast(T18, dtype=DataType.Int)
    T22 = fd.ops.sub(T20, T21)
    S23 = fd.define_scalar(15, dtype=DataType.Int)
    T24 = fd.ops.bitwise_and(T22, S23)

The very last bitwise_and involves a floating type T22 tensor, which is incorrect.

jacobhinkle commented 1 month ago

As of #2645 this fusion will now raise an error during fusion definition since a float is passed to a bitwise op.