ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
111 stars 48 forks source link

`rsmi_init` fails during OMPT initialization when target offloading is used #129

Closed Thyre closed 2 months ago

Thyre commented 10 months ago

While testing how our HIP adapter in Score-P interacts with OpenMP target regions, I've encountered the following issue preventing me from testing it.

In Score-P, adapters are divided into several subsystems. Upon startup, one subsystem might initialize all others. In the case of OMPT, the subsystem will probably be the first one to initialize all other ones during ompt_start_tool. Its exactly here where we run into an issue.

Looking at the following source code, we can see whats happening:

#include <stdio.h>
#include <omp-tools.h>
#include <rocm_smi/rocm_smi.h>

#define PRINT_RSMI_ERR(RET) { \
  if (RET != RSMI_STATUS_SUCCESS) { \
    printf("[ERROR] RSMI call returned %d at line %d\n", (RET), __LINE__); \
    const char* error_string; \
    rsmi_status_string( (RET), &error_string ); \
    printf("[ERROR MESSAGE] %s\n", error_string); \
  } \
}

static int
initialize_tool( ompt_function_lookup_t lookup,
                 int                    initialDeviceNum,
                 ompt_data_t*           toolData )
{
    return 1; /* non-zero indicates success */
}

static void
finalize_tool( ompt_data_t* toolData )
{}

ompt_start_tool_result_t*
ompt_start_tool( unsigned int omp_version, /* == _OPENMP */
                 const char*  runtime_version )
{
    static ompt_start_tool_result_t tool = { &initialize_tool,
                                             &finalize_tool,
                                             ompt_data_none };
    rsmi_status_t ret;

    ret = rsmi_init(0);
    PRINT_RSMI_ERR(ret)

    return &tool;
}

int main( void )
{

}

Most of the code it just here to build a valid OMPT interface. When running the code, ompt_start_tool gets called which tries to initialize rocm-smi via rsmi_init. However, because we're still inside of ompt_start_tool, the initialization fails.

$ amdclang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 -lrocm_smi64  reproducer.c 
$ ./a.out
[ERROR] RSMI call returned 8 at line 38
[ERROR MESSAGE] RSMI_STATUS_INIT_ERROR: An error occurred during initialization, during monitor discovery or when when initializing internal data structures

The question is: Is this intended? I also observed that other hip related functions like hipGetDeviceCount fail with a segmentation fault which lead me to believe that all ROCm related stuff is just not initialized and ready to use during the ompt_start_tool call.

Thyre commented 2 months ago

I can confirm that the issue is fixed with LLVM 19git and will therefore eventually also land in ROCm. As the limitation seems to come from tying to call library functions during _dl_start_user, this limitation should probably be documented somewhere if not done already.

CUDA for example includes this paragraph in their documentation:

The CUDA interfaces use global state that is initialized during host program initiation and destroyed during host program termination. The CUDA runtime and driver cannot detect if this state is invalid, so using any of these interfaces (implicitly or explicitly) during program initiation (or termination after main) will result in undefined behavior.

I'm closing the issue.