Looks like we are not generating correct predicates when unswitching double buffered loops. The generated code with DoubleBufferingTest.DoubleBuffer5 seems incorrect. Here's a simplified repro based on that test:
TEST_F(DoubleBufferingTest, UnswitchRepro) {
Fusion fusion;
FusionGuard fg(&fusion);
auto tv0 = makeContigTensor(1);
fusion.addInput(tv0);
auto tv1 = set(tv0);
auto tv2 = add(tv1, IrBuilder::create<Val>(1.0));
fusion.addOutput(tv2);
tv2->split(-1, 8);
tv2->split(0, 1, false);
TransformPropagatorWithCheck propagator(tv2);
MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
tv1->inlineAt(2);
tv2->axis(0)->parallelize(ParallelType::Unswitch);
scheduler_utils::parallelizeAllLike(tv2);
tv1->doubleBuffer();
auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
auto t0 = at::randn({16}, options);
FusionExecutor fe;
fe.compileFusion(&fusion, {t0});
auto cg_outputs = fe.runFusion({t0});
auto ref = t0 + 1;
testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
}
Here's the unswitched branch of the generated code:
Looks like we are not generating correct predicates when unswitching double buffered loops. The generated code with
DoubleBufferingTest.DoubleBuffer5
seems incorrect. Here's a simplified repro based on that test:Here's the unswitched branch of the generated code:
The test doesn't fail, but the read from
T0
can overrun the buffer. Indeed, running this test with compute-sanitizer does fail at that point.I haven't checked if this also applies to circular buffering, but I suspect that's highly likely.