HIP and OpenMP target composability issue when exchanging memory

ye-luo commented 3 years ago

Memory allocated on the host and then registered via hsa_memory_register or hipHostRegister cannot be access in OpenMP target region with "is_device_ptr(array_ptr)"

I think the fundamental reason is that the OpenMP amdgpu plugin and HIP use different contexts underneath even compiled and linked into one application. This fundamental issue disallows implementing optimization to address #160.

This is not an issue in LLVM OpenMP offload to NVIDIA GPUs because the cuda plugin uses the primary context.

JonChesterfield commented 3 years ago

OpenMP and HIP are both implemented on HSA, so there's nothing fundamentally stopping interop. Can you share a reproducible?

ye-luo commented 3 years ago

stand-alone reproducer attached. reproducer_aomp.zip

JonChesterfield commented 3 years ago

Ah, I see. #pragma omp target .. for is_device_ptr(something-from-hipHostMalloc) concludes that is_device_ptr is false.

That seems right - the pointer is not from openmp's allocator, so isn't an 'openmp device pointer'.

It also seems unhelpful. Instead of 'is_device_ptr' only being true for openmp allocated memory, it would be more useful for it to query kfd to find out if the given address is mapped onto the gpu.

I'll have to look into what is_device_ptr compiles to - I don't think there's a corresponding host plugin api, so there's a risk that openmp is building up it's own table of device pointers, independent to however hip tracks them.

ye-luo commented 3 years ago

My need is beyond is_device_ptr(). I would like map(something-from-hipHostMalloc) to have the maximal transfer efficiency and fully being async. This requires the pointer to be recognized by OpenMP and its underlying driver level runtime as a GPU DMA accessible pointer.

gregrodgers commented 3 years ago

This is an enhancement for HIP interoperability.

gregrodgers commented 3 years ago

Ye, I am looking into this. There are lots of options to consider. Hopefully we can get this working in AOMP 11.11-2.

ye-luo commented 3 years ago

@gregrodgers Thank you for looking into this.

Although the example uses is_device_ptr, my real intention is to have fully asynchronous transfer. Using is_device_ptr is only to check if OpenMP side can handle external memory or not. CUDA/HIP asynchronous memory transfer requires pinned host memory to avoid staging data. So I would like to have the pinned memory allocated from HIP to transfer asynchronously at an OpenMP map clause.

There is also a reverse direction check. Using memory allocated omp_target_alloc in HIP. Hopefully the final solution supports bi-direction exchange memory pointers.

gregrodgers commented 3 years ago

I have had a lot of fun with your reproducer. In the attached reproducer_patch.zip

patch, I remove the is_device_ptr clause and then I add statements to your allocator class following hipHostRegister. These are the statements.

  void* dev_ptr ;
    unsigned flagsPtr;
    hipHostGetFlags(&flagsPtr,pt);
    hipHostGetDevicePointer(&dev_ptr,(void*) pt,flagsPtr);
    omp_target_associate_ptr(pt, dev_ptr, n*sizeof(T), 0, 0);

To get OpenMP to use the array_ptr in the target region, we need to associate the pointer with omp_target_associate_ptr. But we must have a device pointer to do this. So I use hipHostGetDevicePointer to get this value. In the case of your ROCR allocator, I don't believe a device pointer exists yet after the hsa_memory_register. So I use hip to register it just as was done in the HIP allocator. With these changes, the target region can use the host pointers.

Now here is the important question. Should the use of the is_device_ptr clause have the same effect as the extra code I added to your allocator class. In that case we have a bug. Otherwise this issue is just educational for all of us.

ye-luo commented 3 years ago

Ignore the ROCR, I was not even sure I did all the steps right. In the HIP case,

The CUDA version of the code doesn't require those HIP dances probably due to unified virtual address (UVA) since CUDA 4.0. Doesn't AMD have a similar feature? It seems so, there is zero-copy GPU access . Why OpenMP kernel cannot dereference the pointer?
Why AOMP can access a device pointer from a HIP context? Are they sharing a context? If yes, could you point me the source code handling the shared context. If no, it is just working by accident? From security perspective, it should not be allowed.

My AMD software stack knowledge is still limited. I'd like to understand why it doesn't work.

   // where zero-copy cost is OK
  #pragma omp target teams distribute parallel for is_device_ptr(array_ptr)
  for (int i = 0; i < array_size; i++)
  {
    array_ptr[i] += i;
  }

  // where I'd like to manage an explicit copy, this copy must be eventually asynchronous
  // and no staging involved. That is why it uses hipHostRegister to register the host memory.
  #pragma omp target teams distribute parallel for map(always, to: array_ptr[:array_size])
  for (int i = 0; i < array_size; i++)
  {
    array_ptr[i] += i;
  }

I need both cases work simultaneously. Your workaround seems solving only part of the problem.

JonChesterfield commented 3 years ago

Reduced this somewhat. 'is_device_ptr' looks like it works to me, but the kernel still faults when writing.

//////////////////////////////////////////////////////////////////////////////////////
// This file is distributed under the University of Illinois/NCSA Open Source
// License. See LICENSE file in top directory for details.
//
// Copyright (c) 2019 QMCPACK developers.
//
// File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory
//
// File created by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory
//
// Modified by Jon Chesterfield (AMD)
//
//////////////////////////////////////////////////////////////////////////////////////

// $ROCM_DIR/aomp/bin/clang++ -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa \
// -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 -D__HIP_PLATFORM_HCC__ \
// -I$ROCM_DIR/aomp/include/hip -I$ROCM_DIR/aomp/include/hsa -o test_omp_memory_interop_aomp.x \
// test_omp_memory_interop_aomp.cpp  -L$ROCM_DIR/lib -lamdhip64 -lhsa-runtime64

#include "hipError.h"
#include <hip_runtime_api.h>

#include "hsa.h"
#include "hsa_ext_amd.h"

void dump_pointer_info(void *ptr) {
  hsa_amd_pointer_info_t info;
  info.size = sizeof(hsa_amd_pointer_info_t);
  uint32_t accessible;
  hsa_agent_t *agents;
  hsa_status_t rc =
      hsa_amd_pointer_info(ptr, &info, malloc, &accessible, &agents);

  if (rc != HSA_STATUS_SUCCESS) {
    fprintf(stderr, "amd_pointer_info failed, got %x\n", rc);
    exit(1);
  }
  fprintf(stderr, "pointer %p accessible by %u agents\n", ptr, accessible);
  for (uint32_t i = 0; i < accessible; i++) {
    fprintf(stderr, "agent[%u]: %lu\n", i, agents[i].handle);
  }
}

const int array_size = 100;

void test_memory_device_access() {

  void *pt = malloc(array_size * sizeof(int));
  if (!pt) {
    exit(1);
  }

  dump_pointer_info(pt); // accessible by zero

  hipErrorCheck(
      hipHostRegister(pt, array_size * sizeof(int), hipHostRegisterDefault),
      "hipHostRegister failed");

  dump_pointer_info(pt); // accessible by all

  int *array_ptr = (int *)pt;
  array_ptr[0] = 100; // fine

  fprintf(stderr, "host ptr %p\n", array_ptr);

#pragma omp target
  printf("target ptr %p\n", array_ptr); // (nil)

#pragma omp target is_device_ptr(array_ptr)
  {
    printf("in target, ptr = %p\n", array_ptr); // matches host address
    *array_ptr = 42;                            // fault
  }

  fprintf(stderr, "After target %u\n", array_ptr[0]);
}

int main() {
#pragma omp target
  {
    // just load the binary
    asm volatile("//");
  }

  fprintf(stderr, "testing HIP\n");
  test_memory_device_access();
}

Salient output from a run:

pointer 0x8ec110 accessible by 2 agents
host ptr 0x8ec110
target ptr (nil)
in target, ptr = 0x8ec110
[GPU Memory Error] Addr: 0x8ec000 Reason: No Idea!
Memory access fault by GPU node-4 (Agent handle: 0x8770c0) on address 0x8ec000. Reason: Unknown.

Checked manually that the right values are being passed to the kernel. Fault is on the *array_ptr = 42 which lowers to a flat_store_dword. Potentially interesting that the reported address has been rounded down from the dereferenced one, but that might be a property of how hsa reports the error.

Same code works when the memory is allocated from hsa_memory_allocate, instead of malloc + hipHostRegister.

JonChesterfield commented 3 years ago

Not an interop problem, same failure seen from pure hip code.

// $HOME/rocm/aomp/bin/clang++ -x hip --offload-arch=gfx906 -I$HOME/rocm/aomp/include/hip test_hip_memory.hip -o a.out && ./a.out
#include "hipError.h"
#include <hip_runtime_api.h>
#include <stdlib.h>

__global__ void write_kernel(int *d) { *d = 42; }

int main() {
  size_t size = sizeof(int);

  bool use_hip_malloc = true;   // works
  bool use_hip_register = true; // memory access fault by gpu

  if (use_hip_malloc) {
    void *pt;
    hipHostMalloc(&pt, size, 0);
    int *ipt = (int *)pt;
    *ipt = 0;
    write_kernel<<<1, 1>>>(ipt);
    hipDeviceSynchronize();
    fprintf(stderr, "hipHostMalloc: %d\n", *ipt);
  }

  if (use_hip_register) {
    void *pt = aligned_alloc(4096, size);

    hipErrorCheck(hipHostRegister(pt, size, hipHostRegisterDefault),
                  "hipHostRegister failed");

    int *ipt = (int *)pt;
    *ipt = 0;
    write_kernel<<<1, 1>>>(ipt);
    hipDeviceSynchronize();
    fprintf(stderr, "hipHostRegister: %d\n", *ipt);
  }
}

I expect, but have not confirmed, that a rocm hip release will fail the same way.

JonChesterfield commented 3 years ago

See https://github.com/ROCm-Developer-Tools/HIP/issues/2189

It appears this behaviour is intentional. hipHostGetDevicePointer exists and is apparently not a no-op. One could work around by allocating using hsa, instead of hip.

gregrodgers commented 3 years ago

@ye-luo and @JonChesterfield What is the status of this ticket? Lets close this if Ye is able to interoperate with hip using omp_target_associate_ptr.

ye-luo commented 3 years ago

@gregrodgers I think with hipHostGetDevicePointer and omp_target_associate_ptr, I have a workable solution.

ROCm / aomp

HIP and OpenMP target composability issue when exchanging memory #163