ROCm / omnitrace

Omnitrace: Application Profiling, Tracing, and Analysis
https://rocm.docs.amd.com/projects/omnitrace/en/latest/
MIT License
303 stars 27 forks source link

OpenMP offloading #280

Open ooreilly opened 1 year ago

ooreilly commented 1 year ago

I'm trying omnitrace with OpenMP offloading for a small fortran test code. Depending on which system I tested on I encountered different issues. The test code is compiled using the HPE Cray compiler, CCE 15.0.1.

I either saw:

WARNING: Unrecognized OMPT entry_point request ompt_get_record_type
WARNING: Unrecognized OMPT entry_point request ompt_get_record_ompt
WARNING: Unrecognized OMPT entry_point request ompt_get_device_num_procs
WARNING: Unrecognized OMPT entry_point request ompt_callback_mutex
WARNING: Unrecognized OMPT entry_point request ompt_callback_nest_lock
WARNING: Unrecognized OMPT entry_point request ompt_callback_flush
WARNING: Unrecognized OMPT entry_point request ompt_callback_cancel
WARNING: Unrecognized OMPT entry_point request ompt_callback_dispatch
WARNING: Unrecognized OMPT entry_point request ompt_callback_buffer_request
WARNING: Unrecognized OMPT entry_point request ompt_callback_buffer_complete
WARNING: Unrecognized OMPT entry_point request ompt_callback_dependences
WARNING: Unrecognized OMPT entry_point request ompt_callback_task_dependence
[omnitrace][21794][2045] No signals to block...
[omnitrace][21794][2044] No signals to block...
[omnitrace][21794][OnLoad] Loading ROCm tooling...
[omnitrace][21794][0][OnLoad] Setting rocm_smi state to active...
[omnitrace][21794][0][OnLoad] Requesting roctracer to setup...
[omnitrace][21794][PID=21794][rank=0] Thread 1 [0x000000000000552b] (#5) (parent: 0 [0x0000000000005522] (#0)) created
[omnitrace][21794][PID=21794][rank=0] Thread 1 [0x000000000000552b] (#5) (parent: 0 [0x0000000000005522] (#0)) exited
 n =  1100000000
 Data size (read and write): 17.600000000000001 GB
terminate called after throwing an instance of 'std::runtime_error'
  what():  Error! nullptr to ompt_data_t! key = ompt_target_enter_data_dev_0

or:

OMNITRACE: HSA_TOOLS_LIB=/pfs/lustrep2/projappl/project_462000125/omnitrace/lib/libomnitrace-dl.so.1.10.0
OMNITRACE: HSA_TOOLS_REPORT_LOAD_FAILURE=1
OMNITRACE: LD_PRELOAD=/pfs/lustrep2/projappl/project_462000125/omnitrace/lib/libomnitrace-dl.so.1.10.0
OMNITRACE: OMP_TOOL_LIBRARIES=/pfs/lustrep2/projappl/project_462000125/omnitrace/lib/libomnitrace-dl.so.1.10.0
OMNITRACE: ROCP_HSA_INTERCEPT=1
OMNITRACE: ROCP_TOOL_LIB=/pfs/lustrep2/projappl/project_462000125/omnitrace/lib/libomnitrace.so.1.10.0
srun: error: nid007263: task 0: Exited with exit code 255
srun: launch/slurm: _step_signal: Terminating StepId=3480167.3

Any idea what is happening here? Thanks!

ppanchad-amd commented 1 month ago

Hi @ooreilly. Internal ticket has been created to investigate your issue. Thanks!

darren-amd commented 1 month ago

Hi @ooreilly,

I tried running a simple Fortran example with OpenMP offloading and was unable to reproduce the error on omnitrace-instrument v1.11.2, ROCm 6.2.2, and the GNU Fortran compiler. Could you please provide more information so that I may further investigate:

  1. The Fortran example you are running
  2. The OS, GPU and ROCm version of the 2 systems
  3. Omnitrace version omnitrace-instrument --version
  4. Commands you are using to compile the test code and run omnitrace

Also, I wanted to confirm if the compiled executable runs as expected without omnitrace? Having this information should allow me to help further, thanks!

ooreilly commented 1 month ago

Hi @darren-amd,

Thanks for investigating. Please point me to the internal ticket (ping Ossian O'Reilly on teams). 1.

program bandwidth

    use iso_c_binding
    use omp_lib
    implicit none
    !$omp requires unified_shared_memory
    ! Set input array size to be a multiple of the CU count on a single MI250x
    integer, parameter :: n = 110 * 10000000, nthreads = 1024
    integer :: i, j, num_devices, nteams
    double precision :: GB
    double precision, allocatable, dimension(:) :: a, b
    double precision :: t0, t1, elapsed

    allocate(a(n))
    allocate(b(n))

    GB = 1000**3

    call omp_set_default_device(0)
    num_devices = omp_get_num_devices()

    ! Pick a number of teams that is multiple of the CU count
    nteams = 110 * 1000

    a = 1.0

    print *, "n = ", n
    print *, "Data size (read and write):", (c_sizeof(a) + c_sizeof(b)) / GB, "GB"

    t0 = omp_get_wtime()
    !$omp target enter data map(to:a, b)
    t1 = omp_get_wtime()

    elapsed = t1 - t0
    print *, "Initial Map elapsed:", elapsed, " s", " Bandwidth:", ( (c_sizeof(a) + c_sizeof(b)) / GB ) / elapsed, " GB/s"

    do i=1,100

        t0 = omp_get_wtime()
        !$omp target teams distribute parallel do simd num_teams(nteams) thread_limit(nthreads)
        do j=1,n
            b(j) = a(j)
        end do
        t1 = omp_get_wtime()

        elapsed = t1 - t0

        print *, "Elapsed:", elapsed, " s", " Bandwidth:", ( (c_sizeof(a) + c_sizeof(b)) / GB ) / elapsed, " GB/s"

    end do

    !$omp target update from(a,b)

    if (a(n) /= b(n)) then
        print *, "Error: a != b!", a(n), b(n)
    endif

end program
  1. OpenSuse 15.4, MI250X, ROCm 5.3. I am not sure what you mean by "two systems".
  2. omnitrace-instrument v1.10.0 (rev: 9de3a6b0b4243bf8ec10164babdd99f64dbc65f2, tag: v1.10.0, compiler: GNU v7.5.0, rocm: v5.3.x)

  3. I don't recall

Yes, the compiled executable runs as expected without omnitrace.