Closed joerowell closed 1 month ago
@jataylo Do you know who might have experience with this issue?
I've spent some time today looking more into this.
hipMemAttachGlobal
or hipMemAttachHost
the behaviour is the same.#include<hip_runtime_api.h>
#include<iostream>
#include<cstdint>
int main(int argc, char** argv) {
int *ptr;
auto err = hipMallocManaged(&ptr, sizeof(int)*100, hipMemAttachGlobal);
if (err != hipSuccess) {
std::cerr << "Err: could not malloc in unified memory!" << std::endl;
}
uint64_t dev_ptr;
int status = hipPointerGetAttribute(&dev_ptr, HIP_POINTER_ATTRIBUTE_DEVICE_POINTER, (hipDeviceptr_t)ptr);
if (status == HIP_ERROR_INVALID_VALUE) {
std::cerr << "Err: not device pointer!" << std::endl;
} else {
std::cerr << "Device pointer!" << std::endl;
}
hipFree(ptr);
}
I'll keep digging.
So I've dug more into this.
It looks like the issue is that whenever a Triton kernel is loaded for the first time, the HSA is reinitialised. This doesn't happen when e.g Pytorch launches a hip kernel, so I find this behaviour odd.
As evidence, here's some trimmed logs. When the program loads, we get an initialisation output:
----------------------------- Captured stderr call -----------------------------
:3:rocdevice.cpp :445 : 1930722022926 us: [pid:1803375 tid:0x7ff4d12eb740] Initializing HSA stack.
:3:comgrctx.cpp :33 : 1930722121088 us: [pid:1803375 tid:0x7ff4d12eb740] Loading COMGR library.
:3:rocdevice.cpp :213 : 1930722121152 us: [pid:1803375 tid:0x7ff4d12eb740] Numa selects cpu agent[0]=0x563c0ecf4e10(fine=0x563c0ebb8b90,coarse=0x563c0ed4dc60) for gpu agent=0x563c0edbe790 CPU<->GPU XGMI=0
:3:rocdevice.cpp :1716: 1930722122316 us: [pid:1803375 tid:0x7ff4d12eb740] Gfx Major/Minor/Stepping: 9/4/2
:3:rocdevice.cpp :1718: 1930722122322 us: [pid:1803375 tid:0x7ff4d12eb740] HMM support: 1, XNACK: 0, Direct host access: 0
:3:rocdevice.cpp :1720: 1930722122324 us: [pid:1803375 tid:0x7ff4d12eb740] Max SDMA Read Mask: 0xffff, Max SDMA Write Mask: 0xffff
:4:rocdevice.cpp :2099: 1930722122389 us: [pid:1803375 tid:0x7ff4d12eb740] Allocate hsa host memory 0x7ff4cc4d6000, size 0x38
:4:rocdevice.cpp :2099: 1930722122692 us: [pid:1803375 tid:0x7ff4d12eb740] Allocate hsa host memory 0x7fefed800000, size 0x101000
:4:rocdevice.cpp :2099: 1930722123089 us: [pid:1803375 tid:0x7ff4d12eb740] Allocate hsa host memory 0x7fefed600000, size 0x101000
:4:runtime.cpp :83 : 1930722123122 us: [pid:1803375 tid:0x7ff4d12eb740] init
:3:hip_context.cpp :48 : 1930722123125 us: [pid:1803375 tid:0x7ff4d12eb740] Direct Dispatch: 1
Then later the unified memory allocation takes place:
:3:hip_hmm.cpp :64 : 1930722812629 us: [pid:1803375 tid:0x7ff4d12eb740] hipMallocManaged ( 0x7fff8f396988, 40960000, 2 )
:3:hip_hmm.cpp :304 : 1930722819403 us: [pid:1803375 tid:0x7ff4d12eb740] ihipMallocManaged ptr=0x7feaa38f0000
:3:hip_hmm.cpp :71 : 1930722819409 us: [pid:1803375 tid:0x7ff4d12eb740] hipMallocManaged: Returned hipSuccess : 0x7feaa38f0000
And when the Triton kernel loads, the HSA is reinitialised:
:3:rocdevice.cpp :442 : 1930722824481 us: [pid:1803375 tid:0x7ff4d12eb740] Initializing HSA stack.
:3:comgrctx.cpp :33 : 1930722824534 us: [pid:1803375 tid:0x7ff4d12eb740] Loading COMGR library.
:3:rocdevice.cpp :208 : 1930722824572 us: [pid:1803375 tid:0x7ff4d12eb740] Numa selects cpu agent[0]=0x563c0ecf4e10(fine=0x563c0ebb8b90,coarse=0x563c0ed4dc60) for gpu agent=0x563c0edbe790 CPU<->GPU XGMI=0
:3:rocdevice.cpp :1680: 1930722825173 us: [pid:1803375 tid:0x7ff4d12eb740] Gfx Major/Minor/Stepping: 9/4/2
:3:rocdevice.cpp :1682: 1930722825177 us: [pid:1803375 tid:0x7ff4d12eb740] HMM support: 1, XNACK: 0, Direct host access: 0
:3:rocdevice.cpp :1684: 1930722825179 us: [pid:1803375 tid:0x7ff4d12eb740] Max SDMA Read Mask: 0x7c460a0a, Max SDMA Write Mask: 0x89907e12
:4:rocdevice.cpp :2063: 1930722825234 us: [pid:1803375 tid:0x7ff4d12eb740] Allocate hsa host memory 0x7ff4cac16000, size 0x38
:4:rocdevice.cpp :2063: 1930722825543 us: [pid:1803375 tid:0x7ff4d12eb740] Allocate hsa host memory 0x7feaa1d00000, size 0x101000
:4:rocdevice.cpp :2063: 1930722825876 us: [pid:1803375 tid:0x7ff4d12eb740] Allocate hsa host memory 0x7feaa1b00000, size 0x101000
:4:runtime.cpp :83 : 1930722825912 us: [pid:1803375 tid:0x7ff4d12eb740] init
:3:hip_context.cpp :48 : 1930722825915 us: [pid:1803375 tid:0x7ff4d12eb740] Direct Dispatch: 1
I'm not sure if this should be filed as a rocclr bug, or if it's due to how this version of Triton initialises the runtime. Do you have any thoughts?
I've investigated this further.
The issue is that the ROCM version of Triton uses the libamdhip64.so
that comes with the system-wide installation of ROCM. However, pytorch-rocm
also ships with its own version of libamdhip64.so
that is used by Pytorch. This means that there's two separate runtimes that are initialised, and thus there's actually very little guarantee that memory allocated in one runtime can be used in the other.
This already fixed by upstream triton. Let us know if you still bump into this.
Problem Description
I have some code that returns a Pytorch tensor from C++ that's backed by unified memory (e.g via
hipMallocManaged
). On CUDA platforms, I can pass the resulting pointer directly to Triton and the compiler doesn't complain. On the other hand, I get this error on HIP-based systems:It looks like this is because of the following returns
Device pointer!
on CUDA platforms:I can't test it at the moment, but my gut feeling is that the same is not true for HIP?
Operating System
Ubuntu 22.04
CPU
-
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.0.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response