Closed zasdfgbnm closed 1 year ago
I noticed a performance drop from 860 GB/s to 760 GB/s on A100-80G for case NvFuserScheduler_BatchNorm_fp32/512/32/64
after this commit.
This was a bug fix of a memory violation, so if the benchmark perf is affected, it could mean the previous generated code had a memory violation. Could you try the benchmark with and without this commit and see if there's any memory violation. For global and shared memory, compute-sanitizer generally works, but it usually doesn't say anything about registers.
I found this issue from manually reading a kernel when working on the loop rotation pass. I don't know how to test this. Tried with
which is similar to the test I was looking at for the loop rotation PR, but this doesn't reproduce the failure.