[WIP] microbenchmark codegen error

🐛 Describe the bug

Will get a repro once I verified the issue on devel branch.

Currently the issue comes from microbenchmarks. https://github.com/pytorch/pytorch/issues/75282

autogen-15 error, TorchScript IR:

with prim::CudaFusionGroup_0 = graph(%10 : Float(768, strides=[1], requires_grad=0, device=cuda:0),                     
      %13 : Float(512, 768, strides=[768, 1], requires_grad=0, device=cuda:0),                                          
      %11 : int):                                                                                                       
  %7 : int[] = prim::Constant[value=[512, 768]]()                                                                       
  %14 : int[] = prim::Constant[value=[1, 512, 768]]()                                                                   
  %15 : Float(1, 512, 768, strides=[393216, 768, 1], requires_grad=0, device=cuda:0) = prim::reshape_copy(%13, %14)     
  %12 : Float(1, 512, 768, strides=[393216, 768, 1], requires_grad=0, device=cuda:0) = aten::add(%15, %10, %11)         
  %5 : Float(1, 512, 768, strides=[393216, 768, 1], requires_grad=0, device=cuda:0) = prim::reshape_copy(%12, %14)      
  %2 : Float(512, 768, strides=[768, 1], requires_grad=0, device=cuda:0) = prim::reshape_copy(%12, %7)                  
  return (%2, %5)

Fusion ir and error:

Inputs:                                                                                                                 
  T0_g[ iS0{i1} ], float                                                                                                
  T1_g[ iS1{i2}, iS2{i3} ], float                                                                                       
  i4, int64_t                                                                                                           
Outputs:                                                                                                                
  T6_g[ rS15{1}, iS16{i2}, iS17{i3} ], float                                                                            
  T5_g[ bS12{1}, iS13{i2}, iS14{i3} ], float                                                                            

%kernel_math {                                                                                                          
T2_l[ bS3{1}, iS4{i2}, iS5{i3} ] = broadcast( T1_g[ iS1{i2}, iS2{i3} ] )                                                
T3_l[ bS6{1}, bS7{1}, iS8{i1} ] = broadcast( T0_g[ iS0{i1} ] )                                                          
d6 = (double)(i4);                                                                                                      
T4_l[ bS9{1}, bS10{1}, iS11{i1} ]                                                                                       
   = T3_l[ bS6{1}, bS7{1}, iS8{i1} ]                                                                                    
   * d6;                                                                                                                
T5_g[ bS12{1}, iS13{i2}, iS14{i3} ]                                                                                     
   = T2_l[ bS3{1}, iS4{i2}, iS5{i3} ]                                                                                   
   + T4_l[ bS9{1}, bS10{1}, iS11{i1} ];                                                                                 
T6_g[ rS15{1}, iS16{i2}, iS17{i3} ] = reduction( T5_g[ bS12{1}, iS13{i2}, iS14{i3} ], op = add, initial value = double(0), fused = 0 )
}                                                                                                                       

Traceback (most recent call last):                                                                                      
  File "run_microbenchmarks.py", line 24, in <module>                                                                   
    run()                                                                                                               
  File "run_microbenchmarks.py", line 19, in run                                                                        
    microbenchmark.run(bm_args)                                                                                         
  File "/raid/playground/nick/benchmark/torchbenchmark/microbenchmarks/nvfuser/__init__.py", line 150, in run           
    run_nvfuser_microbenchmarks(extra_args=args)                                                                        
  File "/raid/playground/nick/benchmark/torchbenchmark/microbenchmarks/nvfuser/__init__.py", line 146, in run_nvfuser_microbenchmarks
    outputs.append((fuser, b.run_test(inputs, fuser)))                                                                  
  File "/raid/playground/nick/benchmark/torchbenchmark/microbenchmarks/nvfuser/__init__.py", line 125, in run_test      
    return run_test(self.ir, inputs, warmup_runs=self.warmup_runs, test_runs=self.test_runs)                            
  File "/raid/playground/nick/benchmark/torchbenchmark/microbenchmarks/nvfuser/__init__.py", line 87, in run_test       
    graph(*inputs)                                                                                                      
RuntimeError: The following operation failed in the TorchScript interpreter.                                            
Traceback of TorchScript (most recent call last):                                                                       
RuntimeError: rhs_i >= 0 && lhs_i >= 0 INTERNAL ASSERT FAILED at "/raid/pytorch/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp":668, please report a bug to PyTorch.

Versions

Reproed this on master branch + microbenchmark in David's repo. I'll extract a repro on our devel branch and raise this issue to the team.

csarofeen / pytorch

[WIP] microbenchmark codegen error #1562

🐛 Describe the bug

Versions