Device function malloc Throws Hardware Exception

matinraayai commented 3 years ago

Hello, After reading the HIP programming documentation, I was under the impression that calling malloc inside a __global__ function is supported; However, the following code throws the following exception when compiled with hipcc. Compiling with nvcc works as intended.

#include <hip/hip_runtime.h>
#include <iostream>

#if defined(__CUDACC__)
    #define SIZE 32
#elif defined(__HIP__)
    #define SIZE 64
#endif

__device__ int my_global_memory[SIZE];

__global__
void k_print_vector() {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    int* my_ptr;
    my_ptr = (int*) malloc(sizeof(int));
    *my_ptr = i;
    my_global_memory[i] = i;
#if defined(__CUDACC__)
    printf("CUDA printf: element idx [%d]: %d, %d\n", i, my_global_memory[i], *my_ptr);
    free(my_ptr);
#endif
#if defined(__HIP__)
    printf("HIP printf: element idx [%d]: %d, %d\n", i, my_global_memory[i], *my_ptr);
#endif
}

int main() {
    hipLaunchKernelGGL(k_print_vector, 1, SIZE, 0, 0);
    auto err = hipDeviceSynchronize();
    if (err != hipSuccess) {
        std::cerr << "Kernel Launch failed! error code: " << err << std::endl;
    }
    return 0;
}

AMD output:

:0:rocdevice.cpp            :2533: 9757626138808 us: Device::callbackQueue aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016

CUDA output:

CUDA printf: element idx [0]: 0, 0
CUDA printf: element idx [1]: 1, 1
CUDA printf: element idx [2]: 2, 2
CUDA printf: element idx [3]: 3, 3
CUDA printf: element idx [4]: 4, 4
CUDA printf: element idx [5]: 5, 5
CUDA printf: element idx [6]: 6, 6
CUDA printf: element idx [7]: 7, 7
CUDA printf: element idx [8]: 8, 8
CUDA printf: element idx [9]: 9, 9
CUDA printf: element idx [10]: 10, 10
CUDA printf: element idx [11]: 11, 11
CUDA printf: element idx [12]: 12, 12
CUDA printf: element idx [13]: 13, 13
CUDA printf: element idx [14]: 14, 14
CUDA printf: element idx [15]: 15, 15
CUDA printf: element idx [16]: 16, 16
CUDA printf: element idx [17]: 17, 17
CUDA printf: element idx [18]: 18, 18
CUDA printf: element idx [19]: 19, 19
CUDA printf: element idx [20]: 20, 20
CUDA printf: element idx [21]: 21, 21
CUDA printf: element idx [22]: 22, 22
CUDA printf: element idx [23]: 23, 23
CUDA printf: element idx [24]: 24, 24
CUDA printf: element idx [25]: 25, 25
CUDA printf: element idx [26]: 26, 26
CUDA printf: element idx [27]: 27, 27
CUDA printf: element idx [28]: 28, 28
CUDA printf: element idx [29]: 29, 29
CUDA printf: element idx [30]: 30, 30
CUDA printf: element idx [31]: 31, 31

Could you please clarify the development status of this feature? We are teaching a course on HIP so it would help us get the correct information across.

Thanks, Matin

b-sumner commented 1 year ago

@matinraayai at the time this was opened, the malloc implementation was unfortunately not ready for use and the documentation should have said so. The implementation available in ROCm 5.4 and later should work as you expect.

psychocoderHPC commented 1 year ago

For all following this issues there is an malloc test/example in the code base: https://github.com/ROCm-Developer-Tools/HIP/blob/78aaa848a4470eb78c5e25f615856d51462b6ed6/tests/src/deviceLib/hipDeviceMalloc.cpp

psychocoderHPC commented 1 year ago

@b-sumner Is it required to set an allocation pool size for the on-device malloc operation equal to cudaDeviceSetLimit on NVIDIA cards? I know this was in the early version of ROCm a compile-time definition. I found in the documentation hipDeviceSetMemPool, ... but it is not clear what kind of mem-pool this function influences and how to use it correctly. I do not find an example of how to use hipDeviceSetMemPool and other related parts.

b-sumner commented 1 year ago

It is not. This implementation allows the "heap" to grow as large as needed. Note that this differs from cuda and applications using this feature may have trouble when running elsewhere.

matinraayai commented 1 year ago

@b-sumner thanks for the update.

ROCm / ROCm-Device-Libs

Device function malloc Throws Hardware Exception #76