facebookincubator / AITemplate

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Apache License 2.0
4.57k stars 372 forks source link

Calling kernels outside of AITemplate Wrappers #85

Closed mvpatel2000 closed 2 years ago

mvpatel2000 commented 2 years ago

I'm interested in benchmarking some of the cutlass code against various custom triton kernels I've written. I'm trying to directly invoke functions from generated cuda kernels, but I'm hitting some strange CUDA issues with illegal memory accesses. I assume I have some obvious data preparation step I'm missing before calling ctypes... would love some pointers on if there's something special with the generated kernels I'm missing.

Benchmark script (sprayed with continuous, cuda, half calls to be safe until I get it working):

import ctypes
import copy

import torch

import triton

group_norm_cu = ctypes.cdll.LoadLibrary("./group_norm_test.so")

def group_norm(x, G, weight, bias, eps):
    N, C, H, W = x.shape

    x = copy.deepcopy(x).permute(0, 2, 3, 1).contiguous().cuda().half() # Move channels last

    out = torch.empty_like(x).cuda().half()
    weight, bias = weight.contiguous().cuda().half(), bias.contiguous().cuda().half()
    max_smem_size = 0 # Force fallback to welford group norm
    stream_ptr = ctypes.c_void_p()
    group_norm_cu.groupnorm_211(out.data_ptr(), x.data_ptr(), weight.data_ptr(), bias.data_ptr(), N, ctypes.c_float(eps), max_smem_size, stream_ptr)

    out = out.permute(0, 3, 1, 2).contiguous() # Undo channels last

    return out

def test_group_norm(x_shape, num_groups, dtype, eps=1e-5, device='cuda'):
    torch.manual_seed(0)
    # create data
    C = x_shape[1]
    w_shape = (C, )
    weight = torch.rand(w_shape, dtype=dtype, device='cuda')
    bias = torch.rand(w_shape, dtype=dtype, device='cuda')
    x = -2.3 + 0.5 * torch.randn(x_shape, dtype=dtype, device='cuda')
    # forward pass
    y_ref = torch.nn.functional.group_norm(x, num_groups, weight, bias, eps).to(dtype)
    y_tri = group_norm(x, num_groups, weight, bias, eps)
    print(y_tri)
    # compare
    # triton.testing.assert_almost_equal(y_tri, y_ref)

test_group_norm((4, 1024, 14, 14), 32, torch.float16)

The cutlass file is one generated by the unit test for groupnorm. The only diff is adding extern "C" for ctypes.

...

extern "C" {

cudaError_t groupnorm_211(half* output,
                          half* input,
                          half* gamma,
                          half* beta,
                          int N,
                          const float eps,
                          const int max_smem_size,
                          cudaStream_t stream)

{
    return invokeGroupNorm<false, 14, 14, 1024, 32>(
            output,
            input,
            gamma,
            beta,
            N,
            eps,
            max_smem_size,
            stream);
}

}
mikeiovine commented 2 years ago

ctypes treats arguments as ints by default. I suspect what's happening is that your 64 bit addresses from tensor.data_ptr() are overflowing the 32 bit integer, causing the kernel to get invalid addresses.

I would try passing your pointers like this:

group_norm_cu.groupnorm_211(
    ctypes.c_void_p(out.data_ptr()), 
    ctypes.c_void_p(x.data_ptr()), 
    ctypes.c_void_p(weight.data_ptr()), 
    ctypes.c_void_p(bias.data_ptr()), 
    N, 
    ctypes.c_float(eps), 
    max_smem_size, 
    ctypes.c_void_p(stream_ptr)
)
antinucleon commented 2 years ago

ctypes is tricky, need to check every dtype is matching between cxx and python. Maybe a quicker way is to put your kernel into this template. https://github.com/facebookincubator/AITemplate/tree/main/examples/06_how_to_add_an_op

mvpatel2000 commented 2 years ago

@mikeiovine that did the trick, thanks!

@antinucleon thank you for the reference. I will take a look and try to move future kernels I test to that framework instead, looks much safer. Tracking down weird CUDA errors is hard...