USM not working - Githubissues

Quuxplusone commented 3 years ago


Bugzilla Link	PR49874
Status	NEW
Importance	P normal
Reported by	Itaru Kitayama (itaru.kitayama@gmail.com)
Reported on	2021-04-06 22:06:54 -0700
Last modified on	2021-04-08 17:47:49 -0700
Version	unspecified
Hardware	PC Linux
CC	a.bataev@hotmail.com, jdoerfert@anl.gov, llvm-bugs@lists.llvm.org, xw111luoye@gmail.com
Fixed by commit(s)
Attachments	`toshi4.cpp` (785 bytes, text/x-csrc)
Blocks
Blocked by
See also

#include <iostream>
#include <omp.h>

//#pragma omp requires unified_shared_memory
#define N 1024

void func(int p[N]) {
  std::cout << p << std::endl;
  std::cout << &p << std::endl;

#pragma omp target
  {
    for (int i = 0; i < N; i++) {
      p[0] = 123456;
    }
  }
  std::cout << p[0] << std::endl;
}

int main() {

  int a[N] = {0};
  std::cout << a << std::endl;
  std::cout << &a << std::endl;

#pragma omp target
  {
    for (int i = 0; i < N; i++) {
      a[0] = 123456;
    }
  }
  std::cout << a[0] << std::endl;

  func(a);

  int *b = new int[N];
  std::cout << b << std::endl;
  std::cout << &b << std::endl;

#pragma omp target
  {
    for (int i = 0; i < N; i++) {
      b[0] = 123456;
    }
  }
  std::cout << b[0] << std::endl;

  func(b);

  return 0;

}

The above test code in C++ fails at runtime on an A100 target. The Linux kernel is enabled HMM, which is necessary for USM to work.

Quuxplusone commented 3 years ago

The error log:


Warning: LBR backtrace method is not supported on this platform. DWARF backtrace
 method will be used.
WARNING: The command line includes a target application therefore the CPU contex
t-switch scope has been set to process-tree.
Collecting data...
Libomptarget --> Init target library!
Libomptarget --> Call to omp_get_num_devices returning 0
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarge
t.rtl.ppc64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'...
Libomptarget --> Successfully loaded library 'libomptarget.rtl.x86_64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices
!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Target CUDA RTL --> Start initializing CUDA
Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
Libomptarget --> Registering RTL libomptarget.rtl.cuda.so supporting 4 devices!
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptar
get.rtl.aarch64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.ve.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ve.so': libomptarget.r
tl.ve.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.amdgpu.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.amdgpu.so': libomptarg
et.rtl.amdgpu.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.rpc.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.rpc.so': libomptarget.
rtl.rpc.so: cannot open shared object file: No such file or directory!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x0000000000402100 is NOT compatible with RTL libomptarge
t.rtl.x86_64.so!
Libomptarget --> Image 0x0000000000402100 is compatible with RTL libomptarget.rt
l.cuda.so!
Libomptarget --> RTL 0x00000000009d3ac0 has index 0!
Libomptarget --> Registering image 0x0000000000402100 with RTL libomptarget.rtl.
cuda.so!
Libomptarget --> Done registering entries!
Libomptarget --> Entering target region with entry point 0x0000000000402028 and
device Id -1
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were fo
und)
Libomptarget --> Use default device id 0
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Call to omp_get_initial_device returning 4
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL --> Init requires flags to 1
Target CUDA RTL --> Getting device 0
Target CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
Target CUDA RTL --> Max CUDA blocks per grid 2147483647 exceeds the hard team limit 65536, capping at the hard limit
Target CUDA RTL --> Using 1024 CUDA threads per block
Target CUDA RTL --> Using warp size 32
Target CUDA RTL --> Device supports up to 65536 CUDA blocks and 1024 threads with a warp size of 32
Target CUDA RTL --> Default number of teams set according to library's default 128
Target CUDA RTL --> Default number of threads set according to library's default 128
Libomptarget --> Device 0 is ready to use.
Target CUDA RTL --> Load data from image 0x0000000000402100
Target CUDA RTL --> CUDA module successfully loaded!
Target CUDA RTL --> Entry point 0x0000000000000000 maps to __omp_offloading_3a_8057b0f2__Z4funcPi_l11 (0x00000000034508b0)
Target CUDA RTL --> Entry point 0x0000000000000001 maps to __omp_offloading_3a_8057b0f2_main_l26 (0x00000000032f79b0)
Target CUDA RTL --> Entry point 0x0000000000000002 maps to __omp_offloading_3a_8057b0f2_main_l40 (0x00000000032fb5a0)
Target CUDA RTL --> Finding global device environment 'omptarget_device_environment' - symbol missing.
Target CUDA RTL --> Continue, considering this is a device RTL which does not accept environment setting.
Libomptarget --> Entry  0: Base=0x00007ffd9ddd39e0, Begin=0x00007ffd9ddd39e0, Size=4096, Type=0x223, Name=unknown
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=4096)...
Target CUDA RTL --> MemoryManagerTy::allocate: size 4096 with host pointer 0x0000000000000000.
Target CUDA RTL --> findBucket: Size 4096 is floored to 4096.
Target CUDA RTL --> Cannot find a node in the FreeLists. Allocate on device.
Target CUDA RTL --> Node address 0x0000000003426eb0, target pointer 0x000014c47da00000, size 4096
Libomptarget --> Creating new map entry: HstBase=0x00007ffd9ddd39e0, HstBegin=0x00007ffd9ddd39e0, HstEnd=0x00007ffd9ddd49e0, TgtBegin=0x000014c47da00000
Libomptarget --> There are 4096 bytes allocated at target address 0x000014c47da00000 - is new
Libomptarget --> Moving 4096 bytes (hst:0x00007ffd9ddd39e0) -> (tgt:0x000014c47da00000)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=4096)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffd9ddd39e0, TgtPtrBegin=0x000014c47da00000, Size=4096, RefCount=1
Libomptarget --> Obtained target argument 0x000014c47da00000 from host pointer 0x00007ffd9ddd39e0
Libomptarget --> Launching target execution __omp_offloading_3a_8057b0f2_main_l26 with pointer 0x0000000003425d30 (index=1).
Target CUDA RTL --> Setting CUDA threads per block to requested 1
Target CUDA RTL --> Adding master warp: +32 threads
Target CUDA RTL --> Using requested number of teams 1
Target CUDA RTL --> Launching kernel __omp_offloading_3a_8057b0f2_main_l26 with 1 blocks and 33 threads in Generic mode
Target CUDA RTL --> Launch of entry point at 0x0000000003425d30 successful!
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=4096)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffd9ddd39e0, TgtPtrBegin=0x000014c47da00000, Size=4096, updated RefCount=1
Libomptarget --> There are 4096 bytes allocated at target address 0x000014c47da00000 - is last
Libomptarget --> Moving 4096 bytes (tgt:0x000014c47da00000) -> (hst:0x00007ffd9ddd39e0)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=4096)...
Libomptarget --> Deleting tgt data 0x000014c47da00000 of size 4096
Target CUDA RTL --> MemoryManagerTy::free: target memory 0x000014c47da00000.
Target CUDA RTL --> findBucket: Size 4096 is floored to 4096.
Target CUDA RTL --> Found its node 0x0000000003426eb0. Insert it to bucket 11.
Libomptarget --> Removing mapping with HstPtrBegin=0x00007ffd9ddd39e0, TgtPtrBegin=0x000014c47da00000, Size=4096
Libomptarget --> Entering target region with entry point 0x0000000000402008 and device Id -1
Libomptarget --> Use default device id 0
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Call to omp_get_initial_device returning 4
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Libomptarget --> Entry  0: Base=0x00007ffd9ddd39e0, Begin=0x00007ffd9ddd39e0, Size=0, Type=0x220, Name=unknown
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=0)...
Libomptarget --> There are 0 bytes allocated at target address 0x0000000000000000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=0)...
Libomptarget --> Obtained target argument 0x0000000000000000 from host pointer 0x00007ffd9ddd39e0
Libomptarget --> Launching target execution __omp_offloading_3a_8057b0f2__Z4funcPi_l11 with pointer 0x00000000032f5ef0 (index=0).
Target CUDA RTL --> Setting CUDA threads per block to requested 1
Target CUDA RTL --> Adding master warp: +32 threads
Target CUDA RTL --> Using requested number of teams 1
Target CUDA RTL --> Launching kernel __omp_offloading_3a_8057b0f2__Z4funcPi_l11
with 1 blocks and 33 threads in Generic mode
Target CUDA RTL --> Launch of entry point at 0x00000000032f5ef0 successful!
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=0)...
Libomptarget --> There are 0 bytes allocated at target address 0x0000000000000000 - is not last
Target CUDA RTL --> Error when synchronizing stream. stream = 0x0000000000a9b9b0, async info ptr = 0x00007ffd9ddd38a0
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Libomptarget --> Call to targetDataEnd failed, abort target.
Libomptarget --> Failed to process data after launching the kernel.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

The target application terminated with signal 6 (SIGABRT)
Processing events...
Saving temporary "/tmp/nsys-report-7c7b-d5a8-db0d-9486.qdstrm" file to disk...

Creating final output files...
^MProcessing [0%                                                                ]^MProcessing [2%                                                                ]^MProcessing [3%                                                                ]^MProcessing [==7%                                                              ]^MProcessing [===10%                                                            ]^MProcessing [========================42%                                       ]^MProcessing [=============================================73%                  ]^MProcessing [=============================================74%                  ]^MProcessing [==============================================75%                 ]^MProcessing [===============================================76%                ]^MProcessing [===============================================77%                ]^MProcessing [=================================================80%              ]^MProcessing [===================================================82%            ]^MProcessing [==============================================================100%]^MProcessing [==============================================================100%]
Saved report file to "/tmp/nsys-report-7c7b-d5a8-db0d-9486.qdrep"
Report file moved to "/p/project/cjzam11/kitayama1/work/report11.qdrep"

Quuxplusone commented 3 years ago

JURECA-DC compute nodes are backed by CentOS 8 kernel, and in the config I see:

CONFIG_ARCH_HAS_HMM=y
CONFIG_MIGRATE_VMA_HELPER=y
CONFIG_DEV_PAGEMAP_OPS=y
CONFIG_HMM=y
CONFIG_HMM_MIRROR=y

..., but the one of the admins says, after contacting NVIDIA support, HMM
feature maturity is not known or minimal.

Quuxplusone commented 3 years ago

$ uname -a
Linux jrlogin07.jureca 4.18.0-193.14.2.el8_2.x86_64 #1 SMP Sun Jul 26 03:54:29
UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Quuxplusone commented 3 years ago

I am wondering whether HMM has arch dependent code, if that's the case, then
I believe Luo's observation on implicit USM only worked on POWER.

Quuxplusone commented 3 years ago

Attached toshi4.cpp (785 bytes, text/x-csrc): Test program in C++

Quuxplusone / LLVMBugzillaTest

USM not working #48843