Closed zasdfgbnm closed 1 year ago
Are vectorized loads and stores as well as cp.async
out of scope?
Is this better than just running the tests with compute-sanitizer?
Enabling this on CI would be great. One concern is assert
can significantly increase register usage, so the scheduling heuristics wouldn't be the same, e.g., the tests would less likely exercise the persistence scheduler. The added boundary-check instruction could also perturb register usage as well. The sanitizer shouldn't have these concerns.
One concern is
assert
can significantly increase register usage, so the scheduling heuristics wouldn't be the same, e.g., the tests would less likely exercise the persistence scheduler.
Sorry this shouldn't be the case as the scheduler doesn't account for the increase of register usage. It should just result in lower performance (potentially with lots of spill)
I don't think compute-sanitizer
will catch all bugs, due to the CUDACachingAllocator
, an out-of-bound access could still fall into allocated memory region from CUDA's perspective.
One concern is
assert
can significantly increase register usage, so the scheduling heuristics wouldn't be the same, e.g., the tests would less likely exercise the persistence scheduler.Sorry this shouldn't be the case as the scheduler doesn't account for the increase of register usage. It should just result in lower performance (potentially with lots of spill)
Yeah, and it could be inconvenient if we use dump effective bandwidth on a unit test for checking perf. And if we forget that, it would waste our development time debugging why this is so slow.
I don't think
compute-sanitizer
will catch all bugs, due to theCUDACachingAllocator
, an out-of-bound access could still fall into allocated memory region from CUDA's perspective.
Doesn't PYTORCH_NO_CUDA_MEMORY_CACHING
disable caching and allocate the exact size? I'm using it pretty frequently to use the sanitizer with the C++ tests.
Not that compute-sanitizer is perfect (it doesn't work well for shared and registers), but for gmem, I thought it should just work fine.
PYTORCH_NO_CUDA_MEMORY_CACHING
Oh, I didn't know that. Indeed, I requested this feature a looooong time ago, but some dev (don't remember who, but that's not a problem, I am not blaming anyone) at that time rejected my proposal.
But with compute sanitizer, out-of-bound access of one tensor can still fall into another tensor, which is not detectable by the compute-sanitizer.
But with compute sanitizer, out-of-bound access of one tensor can still fall into another tensor, which is not detectable by the compute-sanitizer.
Yeah, there's --padding
option, so that can be used to reduce false positives, but not completely.
Are vectorized loads and stores as well as
cp.async
out of scope?Are you going to support them?
Are vectorized loads and stores as well as
cp.async
out of scope?Are you going to support them?
Is it easy to support them? https://github.com/csarofeen/pytorch/blob/167718b6d06558395f86b6d25a68352168b86da2/third_party/nvfuser/runtime/array.cu#L155-L157 They are taking pointers as arguments, and I don't see any easy way to pass in boundaries.
It is doable, but I consider this PR as a lightweight feature that uses a few lines of code to add more checks to our unit tests. If we want a more feature-complete boundary check, it is also OK, but it is out of the scope of this PR.
Are vectorized loads and stores as well as
cp.async
out of scope?Are you going to support them?
Is it easy to support them?
They are taking pointers as arguments, and I don't see any easy way to pass in boundaries. It is doable, but I consider this PR as a lightweight feature that uses a few lines of code to add more checks to our unit tests. If we want a more feature-complete boundary check, it is also OK, but it is out of the scope of this PR.
I don't know if they are easy to support, but given that we try hard vectorizing gmem accesses, not supporting them seems like a major limitation. Running tests with this switch on is definitely better than no error checking, but we should remember the limitation.
I don't know if they are easy to support, but given that we try hard vectorizing gmem accesses, not supporting them seems like a major limitation. Running tests with this switch on is definitely better than no error checking, but we should remember the limitation.
Agree, given this limitation in mind, I think this feature is mostly useful for test cases, which contain many artificial fusions not scheduled for performance. It would be less useful for debugging a real-world example. Thanks for spelling this out; I have never considered this deeply.
This PR adds a new debugging option
PYTORCH_NVFUSER_DUMP="assert_memory_violation"
, when enabled all global memory access will be checked, and out-of-bound access will trigger a device side assert failure. I also intend to enable this globally in our unit tests, but unfortunately, it does fail.... The first failing test isNVFuserTest.FusionReduction2_CUDA
. I don't have time to investigate the failure in unit tests, but adding this infrastructure does help debugging.