cupy.cuda.driver.CUDADriverError: CUDA_ERROR_INVALID_PTX: a PTX JIT compilation failed #124

Closed sandeepnmenon closed 3 years ago

sandeepnmenon commented 3 years ago

Issue: cupy.cuda.driver.CUDADriverError: CUDA_ERROR_INVALID_PTX: a PTX JIT compilation failed I installed cupy using the conda-forge channel conda install -c conda-forge cupy

The error occurs in the Module.load function

line 22, in get_kernel_func
  File "cupy/cuda/function.pyx", line 241, in cupy.cuda.function.Module.load
  File "cupy/cuda/function.pyx", line 243, in cupy.cuda.function.Module.load
  File "cupy_backends/cuda/api/driver.pyx", line 246, in cupy_backends.cuda.api.driver.moduleLoadData
  File "cupy_backends/cuda/api/driver.pyx", line 124, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_INVALID_PTX: a PTX JIT compilation failed

Reproducible script

import torch
    import cupy.cuda
    from pynvrtc.compiler import Program
from collections import namedtuple
import numpy as np

modules = {}

def get_kernel_func(kname, ksrc, dtype):
    if kname+dtype not in modules:
        ksrc = ksrc.replace('DTYPE', dtype)
        #prog = Program(ksrc.encode('utf-8'), (kname+dtype+'.cu').encode('utf-8'))
        #uncomment the line above and comment the line below if it causes the following error: AttributeError: 'Program' object has no attribute '_program'
        prog = Program(ksrc, kname+dtype+'.cu')        
        ptx = prog.compile()
        log = prog._interface.nvrtcGetProgramLog(prog._program)
        if len(log.strip()) > 0: print(log)
        module = cupy.cuda.function.Module()
        modules[kname+dtype] = module
        module = modules[kname+dtype]

    Stream = namedtuple('Stream', ['ptr'])
    s = Stream(ptr=torch.cuda.current_stream().cuda_stream)        

    return module.get_function(kname), s

def conv_aggregate_fw_kernel_v2(**kwargs):
    kernel = r'''
extern "C"
__global__ void conv_aggregate_fw_kernel_v2(DTYPE* dest, const DTYPE* src, const long long* lengths, const long long* cslengths, int width, int N, int dest_stridex, int src_stridex, int blockDimy) {

    int x = blockIdx.x * blockDim.x + threadIdx.x; //one thread per feature channel, runs over all nodes
    if (x >= width) return;

    int i = blockIdx.y * blockDimy;
    int imax = min(N, i + blockDimy);
    dest += dest_stridex * i + x;
    src += src_stridex * (cslengths[i] - lengths[i]) + x;

    for (; i<imax; ++i) {   
        int len = lengths[i];
        if (len > 0) {
            DTYPE sum = 0;      
            for (int j=0; j<len; j++, src += src_stridex) {
                sum += *src;

            *dest = sum / len;          
        else {
            *dest = 0;

        dest += dest_stridex;
    return kernel   

def get_dtype(t):
    if isinstance(t, torch.cuda.FloatTensor):
        return 'float'
    elif isinstance(t, torch.cuda.DoubleTensor):
        return 'double'

starte = 0
idxn = torch.from_numpy(np.random.permutation(10))
input = torch.from_numpy(np.random.permutation(10))
src = torch.index_select(input, 0, idxn.narrow(0,starte,nume)).type(torch.cuda.FloatTensor)

function, stream = get_kernel_func('conv_aggregate_fw_kernel_v2', conv_aggregate_fw_kernel_v2(), get_dtype(src))

leofang commented 3 years ago

Try downgrading your cudatoolkit to 11.0. I think your driver version mismatches with cudatoolkit's.

leofang commented 3 years ago

btw as an aside: your get_kernel_func looks a bit nasty 😄 Maybe you would like to consider using cupy.RawModule which does the same thing for you (but is a lot cleaner)?

(cupy.cuda.function.Module() is internal API and we don't guarantee it's stable across versions.)

sandeepnmenon commented 3 years ago

@leofang Thank you. matching the toolkit version with the driver worked. Also thank you for the suggestion. I tried it out and it works the same. Will make the change

leofang commented 3 years ago

Glad to know, @sandeepnmenon!