Closed mvpatel2000 closed 2 years ago
ctypes
treats arguments as int
s by default. I suspect what's happening is that your 64 bit addresses from tensor.data_ptr()
are overflowing the 32 bit integer, causing the kernel to get invalid addresses.
I would try passing your pointers like this:
group_norm_cu.groupnorm_211(
ctypes.c_void_p(out.data_ptr()),
ctypes.c_void_p(x.data_ptr()),
ctypes.c_void_p(weight.data_ptr()),
ctypes.c_void_p(bias.data_ptr()),
N,
ctypes.c_float(eps),
max_smem_size,
ctypes.c_void_p(stream_ptr)
)
ctypes is tricky, need to check every dtype is matching between cxx and python. Maybe a quicker way is to put your kernel into this template. https://github.com/facebookincubator/AITemplate/tree/main/examples/06_how_to_add_an_op
@mikeiovine that did the trick, thanks!
@antinucleon thank you for the reference. I will take a look and try to move future kernels I test to that framework instead, looks much safer. Tracking down weird CUDA errors is hard...
I'm interested in benchmarking some of the cutlass code against various custom triton kernels I've written. I'm trying to directly invoke functions from generated cuda kernels, but I'm hitting some strange CUDA issues with illegal memory accesses. I assume I have some obvious data preparation step I'm missing before calling
ctypes
... would love some pointers on if there's something special with the generated kernels I'm missing.Benchmark script (sprayed with
continuous
,cuda
,half
calls to be safe until I get it working):The cutlass file is one generated by the unit test for groupnorm. The only diff is adding
extern "C"
forctypes
.