Debugging helper: assert when accessing global tensor out of bound

zasdfgbnm commented 1 year ago

This PR adds a new debugging option PYTORCH_NVFUSER_DUMP="assert_memory_violation", when enabled all global memory access will be checked, and out-of-bound access will trigger a device side assert failure. I also intend to enable this globally in our unit tests, but unfortunately, it does fail.... The first failing test is NVFuserTest.FusionReduction2_CUDA. I don't have time to investigate the failure in unit tests, but adding this infrastructure does help debugging.

naoyam commented 1 year ago

Are vectorized loads and stores as well as cp.async out of scope?

Is this better than just running the tests with compute-sanitizer?

Enabling this on CI would be great. One concern is assert can significantly increase register usage, so the scheduling heuristics wouldn't be the same, e.g., the tests would less likely exercise the persistence scheduler. The added boundary-check instruction could also perturb register usage as well. The sanitizer shouldn't have these concerns.

naoyam commented 1 year ago

One concern is assert can significantly increase register usage, so the scheduling heuristics wouldn't be the same, e.g., the tests would less likely exercise the persistence scheduler.

Sorry this shouldn't be the case as the scheduler doesn't account for the increase of register usage. It should just result in lower performance (potentially with lots of spill)

zasdfgbnm commented 1 year ago

I don't think compute-sanitizer will catch all bugs, due to the CUDACachingAllocator, an out-of-bound access could still fall into allocated memory region from CUDA's perspective.

zasdfgbnm commented 1 year ago

One concern is assert can significantly increase register usage, so the scheduling heuristics wouldn't be the same, e.g., the tests would less likely exercise the persistence scheduler.

Sorry this shouldn't be the case as the scheduler doesn't account for the increase of register usage. It should just result in lower performance (potentially with lots of spill)

Yeah, and it could be inconvenient if we use dump effective bandwidth on a unit test for checking perf. And if we forget that, it would waste our development time debugging why this is so slow.

naoyam commented 1 year ago

I don't think compute-sanitizer will catch all bugs, due to the CUDACachingAllocator, an out-of-bound access could still fall into allocated memory region from CUDA's perspective.

Doesn't PYTORCH_NO_CUDA_MEMORY_CACHING disable caching and allocate the exact size? I'm using it pretty frequently to use the sanitizer with the C++ tests.

naoyam commented 1 year ago

Not that compute-sanitizer is perfect (it doesn't work well for shared and registers), but for gmem, I thought it should just work fine.

zasdfgbnm commented 1 year ago

PYTORCH_NO_CUDA_MEMORY_CACHING

Oh, I didn't know that. Indeed, I requested this feature a looooong time ago, but some dev (don't remember who, but that's not a problem, I am not blaming anyone) at that time rejected my proposal.

zasdfgbnm commented 1 year ago

But with compute sanitizer, out-of-bound access of one tensor can still fall into another tensor, which is not detectable by the compute-sanitizer.

naoyam commented 1 year ago

But with compute sanitizer, out-of-bound access of one tensor can still fall into another tensor, which is not detectable by the compute-sanitizer.

Yeah, there's --padding option, so that can be used to reduce false positives, but not completely.

naoyam commented 1 year ago

Are vectorized loads and stores as well as cp.async out of scope?

Are you going to support them?

zasdfgbnm commented 1 year ago

Are vectorized loads and stores as well as cp.async out of scope?

Are you going to support them?

Is it easy to support them? https://github.com/csarofeen/pytorch/blob/167718b6d06558395f86b6d25a68352168b86da2/third_party/nvfuser/runtime/array.cu#L155-L157 They are taking pointers as arguments, and I don't see any easy way to pass in boundaries.

It is doable, but I consider this PR as a lightweight feature that uses a few lines of code to add more checks to our unit tests. If we want a more feature-complete boundary check, it is also OK, but it is out of the scope of this PR.

naoyam commented 1 year ago

Are vectorized loads and stores as well as cp.async out of scope?

Are you going to support them?

Is it easy to support them?

https://github.com/csarofeen/pytorch/blob/167718b6d06558395f86b6d25a68352168b86da2/third_party/nvfuser/runtime/array.cu#L155-L157

They are taking pointers as arguments, and I don't see any easy way to pass in boundaries. It is doable, but I consider this PR as a lightweight feature that uses a few lines of code to add more checks to our unit tests. If we want a more feature-complete boundary check, it is also OK, but it is out of the scope of this PR.

I don't know if they are easy to support, but given that we try hard vectorizing gmem accesses, not supporting them seems like a major limitation. Running tests with this switch on is definitely better than no error checking, but we should remember the limitation.

zasdfgbnm commented 1 year ago

I don't know if they are easy to support, but given that we try hard vectorizing gmem accesses, not supporting them seems like a major limitation. Running tests with this switch on is definitely better than no error checking, but we should remember the limitation.

Agree, given this limitation in mind, I think this feature is mostly useful for test cases, which contain many artificial fusions not scheduled for performance. It would be less useful for debugging a real-world example. Thanks for spelling this out; I have never considered this deeply.

csarofeen / pytorch

Debugging helper: assert when accessing global tensor out of bound #2493