csarofeen / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
26 stars 7 forks source link

Fix canOmitStopPredicate #2504

Closed zasdfgbnm closed 1 year ago

zasdfgbnm commented 1 year ago

I found this issue from manually reading a kernel when working on the loop rotation pass. I don't know how to test this. Tried with

TEST_F(NVFuserTest, FusionPredicateSize1Loop_CUDA) {
  Fusion fusion;
  FusionGuard fg(&fusion);

  auto tv0 = makeConcreteTensor({3});
  fusion.addInput(tv0);
  auto tv1 = set(tv0);
  auto tv2 = set(tv1);
  auto tv3 = set(tv2);
  auto tv4 = set(tv3);
  fusion.addOutput(tv4);

  for (auto tv : {tv0, tv1, tv2, tv3, tv4}) {
    tv->split(0, 5);
  }

  inlineAllAt(tv4, 1);

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  auto t0 = at::randn({3}, options);

  FusionExecutor fe;
  fe.compileFusion(&fusion, {t0});
  auto cg_outputs = fe.runFusion({t0});

  testValidate(&fusion, cg_outputs, {t0}, {t0}, __LINE__, __FILE__);
}

which is similar to the test I was looking at for the loop rotation PR, but this doesn't reproduce the failure.

liqiangxl commented 1 year ago

I noticed a performance drop from 860 GB/s to 760 GB/s on A100-80G for case NvFuserScheduler_BatchNorm_fp32/512/32/64 after this commit.

naoyam commented 1 year ago

This was a bug fix of a memory violation, so if the benchmark perf is affected, it could mean the previous generated code had a memory violation. Could you try the benchmark with and without this commit and see if there's any memory violation. For global and shared memory, compute-sanitizer generally works, but it usually doesn't say anything about registers.