beehive-lab / TornadoVM

TornadoVM: A practical and efficient heterogeneous programming framework for managed languages
https://www.tornadovm.org
Apache License 2.0
1.17k stars 110 forks source link

Not bitwise operator causing PTX compile error #339

Open Benco11-developement opened 6 months ago

Benco11-developement commented 6 months ago

Describe the bug

PTX compilation fails when using the not bitwise operator "~" on an int :

.version 7.6
.target sm_61
.address_size 64

.visible .entry s0_t0_invert_arrays_intarray_arrays_intarray_4096(.param .u64 .ptr .global .align 8 kernel_context, .param .u64 .ptr .global .align 8 a, .param .u64 .ptr .global .align 8 b, .param .align 8 .u64 size) {
        .reg .s64 rsd<3>;
        .reg .u32 rui<5>;
        .reg .u64 rud<7>;
        .reg .pred rpb<2>;
        .reg .s32 rsi<8>;

BLOCK_0:
        ld.param.u64    rud0, [kernel_context];
        ld.param.u64    rud1, [a];
        ld.param.u64    rud2, [b];
        mov.u32 rui0, %nctaid.x;
        mov.u32 rui1, %ntid.x;
        mul.wide.u32    rud3, rui0, rui1;
        cvt.s32.u64     rsi0, rud3;
        mov.u32 rui2, %tid.x;
        mov.u32 rui3, %ctaid.x;
        mad.lo.s32      rsi1, rui3, rui1, rui2;

BLOCK_1:
        mov.s32 rsi2, rsi1;
LOOP_COND_1:
        setp.lt.s32     rpb0, rsi2, 4096;
        @!rpb0 bra      BLOCK_3;

BLOCK_2:
        add.s32 rsi3, rsi2, 6;
        cvt.s64.s32     rsd0, rsi3;
        shl.b64 rsd1, rsd0, 2;
        add.u64 rud4, rud1, rsd1;
        ld.global.s32   rsi4, [rud4];
        add.u64 rud5, rud2, rsd1;
        not.rn.b32      rsi5, rsi4;
        st.global.s32   [rud5], rsi5;
        add.s32 rsi6, rsi0, rsi2;
        mov.s32 rsi2, rsi6;
        bra.uni LOOP_COND_1;

BLOCK_3:
        ret;
}

[TornadoVM-PTX-JNI] ERROR : cuModuleLoadData -> Returned: 218
PTX to cubin JIT compilation failed! (218)
PTX JIT compilation failed!
[Bailout] Running the sequential implementation. Enable --debug to see the reason.

How To Reproduce

Just run the following code :

import uk.ac.manchester.tornado.api.ImmutableTaskGraph;
import uk.ac.manchester.tornado.api.TaskGraph;
import uk.ac.manchester.tornado.api.TornadoExecutionPlan;
import uk.ac.manchester.tornado.api.annotations.Parallel;
import uk.ac.manchester.tornado.api.enums.DataTransferMode;
import uk.ac.manchester.tornado.api.types.arrays.IntArray;

public class Main {

    public static void invert(IntArray a, IntArray b, int size) {
        for (@Parallel int i = 0; i < size; i++) {
            b.set(i, ~a.get(i));
        }
    }

    public static void main(String[] args) {
        int size = 4096;

        IntArray a = new IntArray(size);
        IntArray b = new IntArray(size);

        a.init(1);
        b.init(0);

        TaskGraph graph = new TaskGraph("s0")
                .transferToDevice(DataTransferMode.FIRST_EXECUTION, a, b)
                .task("t0", Main::invert, a, b, size)
                .transferToHost(DataTransferMode.EVERY_EXECUTION, b);

        ImmutableTaskGraph immutableTaskGraph = graph.snapshot();
        TornadoExecutionPlan executionPlan = new TornadoExecutionPlan(immutableTaskGraph);

        executionPlan.execute();
    }
}

Expected behavior

It should compile and run normally.

Computing system setup (please complete the following information):

Additional context

This works fine with an opencl backend.


jjfumero commented 6 months ago

Hi @Benco11-developement , Thank you for the report. We will take a look