csarofeen / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
26 stars 7 forks source link

broadcast_in_dim: The size of contiguity must equal to the number of non-broadcasting IterDomains #2549

Open IvanYashchuk opened 1 year ago

IvanYashchuk commented 1 year ago

🐛 Describe the bug

from nvfuser import FusionDefinition, DataType
import torch

def nvfuser_fusion_id2(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(symbolic_sizes=[-1, 1], contiguous=[True, True], dtype=DataType.Double, is_cpu=False)
    T1 = fd.define_tensor(symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Double, is_cpu=False)
    T2 = fd.ops.broadcast_in_dim(T0, output_shape=[4, 4], broadcast_dims=[0, 1])
    T3 = fd.ops.div(T1, T2)
    fd.add_output(T3)

a = torch.ones(4, 1, dtype=torch.double, device='cuda')
b = torch.ones(4, 4, dtype=torch.double, device='cuda')

# RuntimeError: The size of contiguity must equal to the number of non-broadcasting IterDomains
with FusionDefinition() as fd:
    nvfuser_fusion_id2(fd)

fd.execute([a, b])

Versions

devel

naoyam commented 1 year ago

This is likely related to the recent contiguity cleanup by @zasdfgbnm.

zasdfgbnm commented 1 year ago

I think it is an issue of the repro.

The definition of contiguity has changed a few hours ago. Now broadcast dimensions do not have a contiguity and for a non-broadcast dimension, it is contiguous if and only if its stride equals stride[i] * shape[i], where i is the next non-broadcast dimension. For example, if I have a tensor shape [8, 10, 5], stride [5, 0, 1], in the past, the contiguity is [False, False, True], but now it is [True, True]. This new definition allows us to have larger vectorize size (in this example, we can now vectorize 8, but in the past, we can not vectorize). In this repro, the second dimension of T0 is broadcast, so the contiguity should be [True].

My question for @IvanYashchuk is:

  1. Where does the contiguity of T0 come from, is it manual, or it comes from some code?
  2. If it is from a code, is it easy to change that code? If not (for example, if there are backward compatibility concerns), I can add a compatibility layer for it.
zasdfgbnm commented 1 year ago

FYI: this is the PR that changes the definition of contiguity: https://github.com/csarofeen/pytorch/pull/2517

IvanYashchuk commented 1 year ago

Where does the contiguity of T0 come from, is it manual, or it comes from some code? If it is from a code, is it easy to change that code?

It's generated from code, and we can change it. Thank you for clarifying what's going on, changing T0's contiguity to [True] works.

We should

IvanYashchuk commented 1 year ago

Probably versioning is not strictly needed, the code generator should use the compute_contiguity function, but probably it doesn't.

IvanYashchuk commented 1 year ago

Alright here's the real problem, the code generator actually uses a different overload of define_tensor with explicit strides=:

from nvfuser import FusionDefinition, DataType
import torch

a = torch.ones(4, 1, dtype=torch.double, device='cuda')
b = torch.ones(4, 4, dtype=torch.double, device='cuda')

def nvfuser_fusion_id2(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(sizes=a.shape, strides=a.stride(), dtype=DataType.Double, is_cpu=False)
    T1 = fd.define_tensor(sizes=b.shape, strides=b.stride(), dtype=DataType.Double, is_cpu=False)
    T2 = fd.ops.broadcast_in_dim(T0, output_shape=[4, 4], broadcast_dims=[0, 1])
    T3 = fd.ops.div(T1, T2)
    fd.add_output(T3)

# RuntimeError: The size of contiguity must equal to the number of non-broadcasting IterDomains
with FusionDefinition() as fd:
    nvfuser_fusion_id2(fd)

print(fd.execute([a, b]))