beehive-lab / TornadoVM

TornadoVM: A practical and efficient heterogeneous programming framework for managed languages
https://www.tornadovm.org
Apache License 2.0
1.19k stars 113 forks source link

Randomized failure of PTX JIT Compilation #528

Open PolyRocketMatt opened 2 months ago

PolyRocketMatt commented 2 months ago

Describe the bug

Whenever I am trying to run my application on the latest TornadoVM build, it only at some occasions throws an error indicating that PTX JIT compilation failed:

Unable to compile task 300502d1-daec-4e34-b335-8fde2503eb00.mxm - addFloat
The internal error is: [Error During the Task Compilation]

How To Reproduce

My Main.java just simply runs the following:

public class Main {

    public static void main(String[] args) {
        run();
    }

    public void run() {
        FloatArray nativeBuffer = getFrom...();
        FloatArray nativeResultBuffer = getFrom...();
        size = 1024;

        // Create a task-graph with multiple tasks. Each task points to an exising Java method
        // that can be accelerated on a GPU/FPGA
        TaskGraph taskGraph = new TaskGraph(UUID.randomUUID().toString())
                .transferToDevice(DataTransferMode.FIRST_EXECUTION, nativeBuffer, nativeResultBuffer) // Transfer data from host to device only in the first execution
                .task("mxm", AdditionTask::addFloat, nativeBuffer, nativeResultBuffer, 1.0f, size)             // Each task points to an existing Java method
                .transferToHost(DataTransferMode.EVERY_EXECUTION, nativeResultBuffer);     // Transfer data from device to host

        // Create an immutable task-graph
        ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();

        // Create an execution plan from an immutable task-graph
        try (TornadoExecutionPlan executionPlan = new TornadoExecutionPlan(immutableTaskGraph)) {

            // Run the execution plan on the default device
            TornadoExecutionResult executionResult = executionPlan.execute();

            if (executionResult.isReady()) {
               ...
            }
        } catch (TornadoExecutionPlanException ex) {
            ex.printStackTrace()
        }
    }
}

The AdditionTask class looks like this:

public class AdditionTask implements BufferTask {

    public static void addFloat(@NotNull FloatArray input, @NotNull FloatArray output,
                                float value, int size) {
        for (@Parallel int i = 0; i < size; i++)
            output.set(i, 1.0f);
    }
}

In this case, BufferTask is just an empty interface.

Expected behavior

I'm expecting the code to run, without throwing any compilation issues. This happens in some cases but not all.

Computing system setup (please complete the following information):

Additional context

The attached log is generated using the --debug flag and is one from my original program. The minimal reproducible example should still be a valid proxy.

debug.txt


jjfumero commented 2 months ago

Hi @PolyRocketMatt , thanks for the report.

To see the generated PTX code, you can use the option --printKernel for the tornado command.

Just having a quick look at the issue, and I am not sure the annotation @NotNull is supported. It might happen that IR is not clean (ready to be consumed by the PTX code generator) because the annotation introduced more nodes. Can you check without this?

PolyRocketMatt commented 2 months ago

Hi, thanks for the fast response.

I have tried compiling and running the code without the @NotNull annotation, but this still seems to work only occasionally still. This is the generated PTX code, which seems to be running in version 7.6:

.version 7.6
.target sm_86
.address_size 64

.visible .entry 90375540_e9cf_461e_b9a2_d7dc8d46e67e_mxm_addfloat_arrays_floatarray_arrays_floatarray_1_0_1024(.param .u64 .ptr .global .align 8 kernel_context, .param .u64 .ptr .global .align 8 input, .param .u64 .ptr .global .align 8 output, .param .align 8 .u64 value, .param .align 8 .u64 size) {
        .reg .s64 rsd<3>;
        .reg .u64 rud<5>;
        .reg .s32 rsi<6>;
        .reg .pred rpb<2>;
        .reg .u32 rui<5>;

BLOCK_0:
        ld.param.u64    rud0, [kernel_context];
        ld.param.u64    rud1, [output];
        mov.u32 rui0, %nctaid.x;
        mov.u32 rui1, %ntid.x;
        mul.wide.u32    rud2, rui0, rui1;
        cvt.s32.u64     rsi0, rud2;
        mov.u32 rui2, %tid.x;
        mov.u32 rui3, %ctaid.x;
        mad.lo.s32      rsi1, rui3, rui1, rui2;

BLOCK_1:
        mov.s32 rsi2, rsi1;
LOOP_COND_1:
        setp.lt.s32     rpb0, rsi2, 1024;
        @!rpb0 bra      BLOCK_3;

BLOCK_2:
        add.s32 rsi3, rsi2, 6;
        cvt.s64.s32     rsd0, rsi3;
        shl.b64 rsd1, rsd0, 2;
        add.u64 rud3, rud1, rsd1;
        st.global.f32   [rud3], 0F3F800000;
        add.s32 rsi4, rsi0, rsi2;
        mov.s32 rsi2, rsi4;
        bra.uni LOOP_COND_1;

BLOCK_3:
        ret;
}
PolyRocketMatt commented 2 months ago

I'm not sure if this is the solution, but removing the UUID from the name of the task graph (and subsequently the PTX code), seems to behave stable. I'll be doing some further experimentation myself to see if this still keeps fixing the errors I was getting.