Rust-GPU / Rust-CUDA

Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.
Apache License 2.0
3.16k stars 120 forks source link

Error: a PTX JIT compilation failed #34

Closed thedodd closed 2 years ago

thedodd commented 2 years ago

Platform: Jetson Nano 2Gi Arch: aarch64/arm64 OS: Linux Ubuntu 18.04 / Tegra

# Same output with -sass, -elf, and pretty much any of the other opts/flags for cuobjdump.
cuobjdump -ptx `which cns-rt`
cuobjdump info    : File '/usr/local/bin/cns-rt' does not contain device code
cuda-memcheck --report-api-errors all cns-rt
========= CUDA-MEMCHECK
========= Program hit CUDA_ERROR_INVALID_PTX (error 218) due to "a PTX JIT compilation failed" on CUDA API call to cuModuleLoadData.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 (cuModuleLoadData + 0x114) [0x1d449c]
=========     Host Frame:cns-rt [0x90dc]
=========     Host Frame:cns-rt [0x8d50]
=========     Host Frame:cns-rt [0x7e38]
=========     Host Frame:cns-rt [0x8e04]
=========     Host Frame:cns-rt [0x8e1c]
=========     Host Frame:cns-rt [0x1ea98]
=========     Host Frame:cns-rt [0x82ec]
=========     Host Frame:/lib/aarch64-linux-gnu/libc.so.6 (__libc_start_main + 0xe0) [0x20720]
=========     Host Frame:cns-rt [0x7afc]
=========
Error: "a PTX JIT compilation failed"
========= ERROR SUMMARY: 1 error

EDIT (added the PTX):

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-30521435
// Cuda compilation tools, release 11.4, V11.4.152
// Based on NVVM 7.0.1
//

.version 7.4
.target sm_61
.address_size 64

    // .globl   add

.visible .entry add(
    .param .u64 add_param_0,
    .param .u64 add_param_1,
    .param .u64 add_param_2,
    .param .u64 add_param_3,
    .param .u64 add_param_4
)
{
    .reg .pred  %p<3>;
    .reg .f32   %f<4>;
    .reg .b32   %r<5>;
    .reg .b64   %rd<14>;

    ld.param.u64    %rd2, [add_param_0];
    ld.param.u64    %rd6, [add_param_1];
    ld.param.u64    %rd3, [add_param_2];
    ld.param.u64    %rd4, [add_param_3];
    ld.param.u64    %rd5, [add_param_4];
    mov.u32     %r1, %ntid.x;
    mov.u32     %r2, %ctaid.x;
    mov.u32     %r3, %tid.x;
    mad.lo.s32  %r4, %r1, %r2, %r3;
    cvt.u64.u32     %rd1, %r4;
    setp.ge.u64     %p1, %rd1, %rd6;
    @%p1 bra    $L__BB0_4;

    setp.lt.u64     %p2, %rd1, %rd4;
    @%p2 bra    $L__BB0_3;
    bra.uni     $L__BB0_2;

$L__BB0_3:
    cvta.to.global.u64  %rd7, %rd5;
    shl.b64     %rd8, %rd1, 2;
    add.s64     %rd9, %rd7, %rd8;
    cvta.to.global.u64  %rd10, %rd3;
    add.s64     %rd11, %rd10, %rd8;
    ld.global.nc.f32    %f1, [%rd11];
    cvta.to.global.u64  %rd12, %rd2;
    add.s64     %rd13, %rd12, %rd8;
    ld.global.nc.f32    %f2, [%rd13];
    add.f32     %f3, %f2, %f1;
    st.global.f32   [%rd9], %f3;

$L__BB0_4:
    ret;

$L__BB0_2:
    trap;

}

An important note is that this is all compiled on an Ubuntu 18.04 arm64 container with Cuda 11.4, but the binary is then moved to the L4T-runtime container (which is needed for the Jetson device) which only supports Cuda 10.2. The docs in the Getting Started section of this repo seem to indicate that such a setup should be fine ... though I may have misinterpreted that statement.

Any ideas on what is causing this issue?

RDambrosio016 commented 2 years ago

Beyond the codegen making invalid PTX (pls post the ptx if you have it so i can check), this is not an issue i can help much with. Especially considering i've never used jetson GPUs or CUDA on aarch. You are probably better off making a test case in CUDA C++ or another tool and posting it to the nvidia forums

thedodd commented 2 years ago

Updated the above with the PTX. Yea, I was going to try to just compile the code directly on the device before building a C++ test case, but the device only has Cuda 10.2 ... so I don't think that will actually work (according to the Getting Started guide anyway).

Thanks boss.

RDambrosio016 commented 2 years ago

The PTX looks correct, so this is a bit out the project's scope, i would suggest making a C++ test case and opening a forum post on the nvidia website, the people there are very helpful :)