`var_mean` fails when reduction occurs across all possible tensor dimensions

kevinstephano commented 1 year ago

🐛 Describe the bug

This bug with var_mean is very specific to the case where the reductions occur across all dimensions of the input tensor. If you remove 1, it will be fine.

import torch
from nvfuser import FusionDefinition

inputs = [
    torch.randn(2, 2, 2, device='cuda'),
]

with FusionDefinition() as fd:
    t0 = fd.from_pytorch(inputs[0])
    t1,t2 = fd.ops.var_mean(t0, [0, 1, 2], 0)
    fd.add_output(t1)
    fd.add_output(t2)

out = fd.execute(inputs)

Error :

[W kernel_ir.cpp:532] Warning: Unroll required but not possible. Register allocation disabled. Loop index: i107 (function isUnrolled)
Traceback (most recent call last):
  File "/workspace/test.py", line 14, in <module>
    out = fd.execute(inputs)
  File "/opt/pytorch/pytorch/nvfuser/__init__.py", line 22, in execute
    return self._execute(inputs)
RuntimeError: false INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch-jit/third_party/nvfuser/csrc/executor.cpp":327, please report a bug to PyTorch. Allocations must be based on constant integers for local memory. However, found: T9_l[ iS61{( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(( T0.size[0] * ( T0.size[1] * T0.size[2] ) ), 4) ), blockDim.x) ), 1) ), gridDim.x) )}, iblockIdx.x60{gridDim.x}, ithreadIdx.x57{blockDim.x}, iUS59{1}, iV55{4} ],  have dynamic allocations but are placed in local memory.

Versions

ToT

IvanYashchuk commented 1 year ago

Interesting, it seems to be tested in PyTorch: https://github.com/pytorch/pytorch/blob/c6d8d10b3e974019dae7ec91a85c6192c6d511fa/torch/testing/_internal/common_methods_invocations.py#L19823-L19826 Maybe it was working in the previous version of nvFuser.

naoyam commented 1 year ago

Yes, this was a relatively recent bug.

csarofeen / pytorch

`var_mean` fails when reduction occurs across all possible tensor dimensions #2486

🐛 Describe the bug

Versions