csarofeen / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
26 stars 7 forks source link

Indexing failure #2560

Closed naoyam closed 1 year ago

naoyam commented 1 year ago

Indexing failure with broadcast and reduction:

TEST_F(NVFuserTest, IndexingFail_CUDA) {
  Fusion fusion;
  FusionGuard fg(&fusion);

  auto tv0 = makeSymbolicTensor(1);
  fusion.addInput(tv0);
  auto tv1 = makeSymbolicTensor(2);
  fusion.addInput(tv1);

  auto tv2 = broadcast(tv0, {false, true});
  auto tv3 = add(tv2, tv1);
  auto tv4 = sum(tv3, {0, 1});
  fusion.addOutput(tv4);

  fusion.printMath();

  tv4->merge(0);
  tv4->split(0, 4);
  auto tv5 = tv4->rFactor({1});

  MaxRootDomainInfoSpanningTree tree(tv5);
  TransformPropagator tp(tv5);
  tree.traverse(&tp);

  inlineAllAt(tv4, 1, true);

  fusion.printMath();

  fusion.printKernel();
}

Outputs:

Inputs:
  T0_g[ iS0{i0} ], float
  T1_g[ iS1{i2}, iS2{i3} ], float
Outputs:
  T4_g[ rS7{i0}, rS8{i3} ], float

%kernel_math {
T2_l[ iS3{i0}, bS4{1} ]
   = broadcast( T0_g[ iS0{i0} ] )
T3_l[ iS5{i0}, iS6{i3} ]
   = T2_l[ iS3{i0}, bS4{1} ]
   + T1_g[ iS1{i2}, iS2{i3} ];
T4_g[ rS7{i0}, rS8{i3} ]
   = reduction( T3_l[ iS5{i0}, iS6{i3} ], op = add, initial value = double(0), allreduce = false )
}

Inputs:
  T0_g[ iS27{( ceilDiv(i0, 4) )}, iS28{4} ], float
  T1_g[ iS22{( ceilDiv(( i2 * i3 ), 4) )}, iS23{4} ], float
Outputs:
  T4_g[ rS17{( ceilDiv(( i0 * i3 ), 4) )} ] produce_pos( 1 ), float

%kernel_math {
T2_l[ iS25{( ceilDiv(( i0 * 1 ), 4) )}, iS26{4} ] ca_pos( 1 )
   = broadcast( T0_g[ iS27{( ceilDiv(i0, 4) )}, iS28{4} ] )
T3_l[ iS19{( ceilDiv(( i0 * i3 ), 4) )}, iS20{4} ] ca_pos( 1 ) produce_pos( 1 )
   = T2_l[ iS25{( ceilDiv(( i0 * 1 ), 4) )}, iS26{4} ] ca_pos( 1 )
   + T1_g[ iS22{( ceilDiv(( i2 * i3 ), 4) )}, iS23{4} ];
T5_l[ iS15{( ceilDiv(( i0 * i3 ), 4) )}rf, rS16{4}rf ] ca_pos( 1 ) produce_pos( 1 )
   = reduction( T3_l[ iS19{( ceilDiv(( i0 * i3 ), 4) )}, iS20{4} ] ca_pos( 1 ) produce_pos( 1 ), op = add, initial value = double(0), allreduce = false )
T4_g[ rS17{( ceilDiv(( i0 * i3 ), 4) )} ] produce_pos( 1 )
   = reduction( T5_l[ iS15{( ceilDiv(( i0 * i3 ), 4) )}rf, rS16{4}rf ] ca_pos( 1 ) produce_pos( 1 ), op = add, initial value = double(0), allreduce = false )
}

unknown file: Failure
C++ exception with description "index_map.find(root_dom[i]) != index_map.end() INTERNAL ASSERT FAILED at "/raid/nmaruyama/debug3/third_party/nvfuser/csrc/index_compute.cpp":2047, please report a bug to PyTorch. Couldn't find root mapping for T2_l[ iS25{( ceilDiv(( T0.size[0] * 1 ), 4) )}, iS26{4} ] ca_pos( 1 ) dim: 0 id: iS30{T0.size[0]}, loops:  rS17{( ceilDiv(( T0.size[0] * T1.size[1] ), 4) )} iS26{4}
naoyam commented 1 year ago

The error occurs at indexing t2 as a consumer:

* thread #1, name = 'nvfuser_tests', stop reason = breakpoint 1.1
  * frame #0: 0x00007fffad555610 libstdc++.so.6`__cxxabiv1::__cxa_throw(obj=0x0000000001091220, tinfo=0x0000000000b9bd90, dest=(libc10.so`c10::Error::~Error() at Exception.h:29))(void *)) at eh_throw.cc:77:1
    frame #1: 0x00007fffcadb5a9a libc10.so`c10::detail::torchCheckFail(func="getNonGlobalConsumerStridedIndices", file="/raid/nmaruyama/debug3/third_party/nvfuser/csrc/index_compute.cpp", line=2047, msg=error: summary string parsing error) at Exception.cpp:86:3
    frame #2: 0x00007fffcadb5ced libc10.so`c10::detail::torchInternalAssertFail(func="getNonGlobalConsumerStridedIndices", file="/raid/nmaruyama/debug3/third_party/nvfuser/csrc/index_compute.cpp", line=2047, condMsg="index_map.find(root_dom[i]) != index_map.end() INTERNAL ASSERT FAILED at \"/raid/nmaruyama/debug3/third_party/nvfuser/csrc/index_compute.cpp\":2047, please report a bug to PyTorch. ", userMsg=error: summary string parsing error) at Exception.cpp:114:3
    frame #3: 0x00007ffff7ad515a libnvfuser_codegen.so`nvfuser::Index::getNonGlobalConsumerStridedIndices(consumer_tv=0x0000000003d02f10, loops=size=2, rotated_loops=0x00007fffffffbeb0, override_index=0x00007fffffffb230) at index_compute.cpp:2038:5
    frame #4: 0x00007ffff7ad66cd libnvfuser_codegen.so`nvfuser::Index::getConsumerStridedIndices(consumer=0x0000000003d02f10, loops=size=2, rotated_loops=0x00007fffffffbeb0, override_index=0x00007fffffffb550, generate_pointer=false) at index_compute.cpp:2236:9