[Issue]: Unified memory tensors aren't seen as accessible to Triton

joerowell commented 4 months ago

Problem Description

I have some code that returns a Pytorch tensor from C++ that's backed by unified memory (e.g via hipMallocManaged). On CUDA platforms, I can pass the resulting pointer directly to Triton and the compiler doesn't complain. On the other hand, I get this error on HIP-based systems:

Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

It looks like this is because of the following returns Device pointer! on CUDA platforms:

#include<cuda.h>
#include<cuda_runtime_api.h>
#include<iostream>
#include<cstdint>

int main(int argc, char** argv) {
        int *ptr;
        auto err = cudaMallocManaged(&ptr, sizeof(int)*100, cudaMemAttachGlobal);
        if (err != cudaSuccess) {
                std::cerr << "Err: could not malloc in unified memory!" << std::endl;
        }

        uint64_t dev_ptr;
        int status = cuPointerGetAttribute(&dev_ptr, CU_POINTER_ATTRIBUTE_DEVICE_POINTER, (uint64_t)ptr);
        if (status == CUDA_ERROR_INVALID_VALUE) {
                std::cerr << "Err: not device pointer!" << std::endl;
        } else {
                std::cerr << "Device pointer!" << std::endl;
        }

        cudaFree(ptr);
}

I can't test it at the moment, but my gut feeling is that the same is not true for HIP?

Operating System

Ubuntu 22.04

CPU

-

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

zhanglx13 commented 4 months ago

@jataylo Do you know who might have experience with this issue?

joerowell commented 4 months ago

I've spent some time today looking more into this.

I observe the same behaviour independent of the flags that are set. That is, if I set hipMemAttachGlobal or hipMemAttachHost the behaviour is the same.
It actually doesn't seem to be related to HIP. This code, for example, returns "Device pointer!" too:

#include<hip_runtime_api.h>
#include<iostream>
#include<cstdint>

int main(int argc, char** argv) {
        int *ptr;
        auto err = hipMallocManaged(&ptr, sizeof(int)*100, hipMemAttachGlobal);
        if (err != hipSuccess) {
                std::cerr << "Err: could not malloc in unified memory!" << std::endl;
        }

        uint64_t dev_ptr;
        int status = hipPointerGetAttribute(&dev_ptr, HIP_POINTER_ATTRIBUTE_DEVICE_POINTER, (hipDeviceptr_t)ptr);
        if (status == HIP_ERROR_INVALID_VALUE) {
                std::cerr << "Err: not device pointer!" << std::endl;
        } else {
                std::cerr << "Device pointer!" << std::endl;
        }

        hipFree(ptr);
}

I'll keep digging.

joerowell commented 4 months ago

So I've dug more into this.

It looks like the issue is that whenever a Triton kernel is loaded for the first time, the HSA is reinitialised. This doesn't happen when e.g Pytorch launches a hip kernel, so I find this behaviour odd.

As evidence, here's some trimmed logs. When the program loads, we get an initialisation output:

----------------------------- Captured stderr call -----------------------------
:3:rocdevice.cpp            :445 : 1930722022926 us: [pid:1803375 tid:0x7ff4d12eb740] Initializing HSA stack.
:3:comgrctx.cpp             :33  : 1930722121088 us: [pid:1803375 tid:0x7ff4d12eb740] Loading COMGR library.
:3:rocdevice.cpp            :213 : 1930722121152 us: [pid:1803375 tid:0x7ff4d12eb740] Numa selects cpu agent[0]=0x563c0ecf4e10(fine=0x563c0ebb8b90,coarse=0x563c0ed4dc60) for gpu agent=0x563c0edbe790 CPU<->GPU XGMI=0
:3:rocdevice.cpp            :1716: 1930722122316 us: [pid:1803375 tid:0x7ff4d12eb740] Gfx Major/Minor/Stepping: 9/4/2
:3:rocdevice.cpp            :1718: 1930722122322 us: [pid:1803375 tid:0x7ff4d12eb740] HMM support: 1, XNACK: 0, Direct host access: 0
:3:rocdevice.cpp            :1720: 1930722122324 us: [pid:1803375 tid:0x7ff4d12eb740] Max SDMA Read Mask: 0xffff, Max SDMA Write Mask: 0xffff
:4:rocdevice.cpp            :2099: 1930722122389 us: [pid:1803375 tid:0x7ff4d12eb740] Allocate hsa host memory 0x7ff4cc4d6000, size 0x38
:4:rocdevice.cpp            :2099: 1930722122692 us: [pid:1803375 tid:0x7ff4d12eb740] Allocate hsa host memory 0x7fefed800000, size 0x101000
:4:rocdevice.cpp            :2099: 1930722123089 us: [pid:1803375 tid:0x7ff4d12eb740] Allocate hsa host memory 0x7fefed600000, size 0x101000
:4:runtime.cpp              :83  : 1930722123122 us: [pid:1803375 tid:0x7ff4d12eb740] init
:3:hip_context.cpp          :48  : 1930722123125 us: [pid:1803375 tid:0x7ff4d12eb740] Direct Dispatch: 1

Then later the unified memory allocation takes place:

:3:hip_hmm.cpp              :64  : 1930722812629 us: [pid:1803375 tid:0x7ff4d12eb740] hipMallocManaged ( 0x7fff8f396988, 40960000, 2 ) 
:3:hip_hmm.cpp              :304 : 1930722819403 us: [pid:1803375 tid:0x7ff4d12eb740] ihipMallocManaged ptr=0x7feaa38f0000
:3:hip_hmm.cpp              :71  : 1930722819409 us: [pid:1803375 tid:0x7ff4d12eb740] hipMallocManaged: Returned hipSuccess : 0x7feaa38f0000

And when the Triton kernel loads, the HSA is reinitialised:

:3:rocdevice.cpp            :442 : 1930722824481 us: [pid:1803375 tid:0x7ff4d12eb740] Initializing HSA stack.
:3:comgrctx.cpp             :33  : 1930722824534 us: [pid:1803375 tid:0x7ff4d12eb740] Loading COMGR library.
:3:rocdevice.cpp            :208 : 1930722824572 us: [pid:1803375 tid:0x7ff4d12eb740] Numa selects cpu agent[0]=0x563c0ecf4e10(fine=0x563c0ebb8b90,coarse=0x563c0ed4dc60) for gpu agent=0x563c0edbe790 CPU<->GPU XGMI=0
:3:rocdevice.cpp            :1680: 1930722825173 us: [pid:1803375 tid:0x7ff4d12eb740] Gfx Major/Minor/Stepping: 9/4/2
:3:rocdevice.cpp            :1682: 1930722825177 us: [pid:1803375 tid:0x7ff4d12eb740] HMM support: 1, XNACK: 0, Direct host access: 0
:3:rocdevice.cpp            :1684: 1930722825179 us: [pid:1803375 tid:0x7ff4d12eb740] Max SDMA Read Mask: 0x7c460a0a, Max SDMA Write Mask: 0x89907e12
:4:rocdevice.cpp            :2063: 1930722825234 us: [pid:1803375 tid:0x7ff4d12eb740] Allocate hsa host memory 0x7ff4cac16000, size 0x38
:4:rocdevice.cpp            :2063: 1930722825543 us: [pid:1803375 tid:0x7ff4d12eb740] Allocate hsa host memory 0x7feaa1d00000, size 0x101000
:4:rocdevice.cpp            :2063: 1930722825876 us: [pid:1803375 tid:0x7ff4d12eb740] Allocate hsa host memory 0x7feaa1b00000, size 0x101000
:4:runtime.cpp              :83  : 1930722825912 us: [pid:1803375 tid:0x7ff4d12eb740] init
:3:hip_context.cpp          :48  : 1930722825915 us: [pid:1803375 tid:0x7ff4d12eb740] Direct Dispatch: 1

I'm not sure if this should be filed as a rocclr bug, or if it's due to how this version of Triton initialises the runtime. Do you have any thoughts?

joerowell commented 3 months ago

I've investigated this further.

The issue is that the ROCM version of Triton uses the libamdhip64.so that comes with the system-wide installation of ROCM. However, pytorch-rocm also ships with its own version of libamdhip64.so that is used by Pytorch. This means that there's two separate runtimes that are initialised, and thus there's actually very little guarantee that memory allocated in one runtime can be used in the other.

jerryyin commented 1 month ago

This already fixed by upstream triton. Let us know if you still bump into this.

ROCm / triton