Open Quuxplusone opened 3 years ago
The error log:
Warning: LBR backtrace method is not supported on this platform. DWARF backtrace
method will be used.
WARNING: The command line includes a target application therefore the CPU contex
t-switch scope has been set to process-tree.
Collecting data...
Libomptarget --> Init target library!
Libomptarget --> Call to omp_get_num_devices returning 0
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarge
t.rtl.ppc64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'...
Libomptarget --> Successfully loaded library 'libomptarget.rtl.x86_64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices
!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Target CUDA RTL --> Start initializing CUDA
Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
Libomptarget --> Registering RTL libomptarget.rtl.cuda.so supporting 4 devices!
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptar
get.rtl.aarch64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.ve.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ve.so': libomptarget.r
tl.ve.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.amdgpu.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.amdgpu.so': libomptarg
et.rtl.amdgpu.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.rpc.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.rpc.so': libomptarget.
rtl.rpc.so: cannot open shared object file: No such file or directory!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x0000000000402100 is NOT compatible with RTL libomptarge
t.rtl.x86_64.so!
Libomptarget --> Image 0x0000000000402100 is compatible with RTL libomptarget.rt
l.cuda.so!
Libomptarget --> RTL 0x00000000009d3ac0 has index 0!
Libomptarget --> Registering image 0x0000000000402100 with RTL libomptarget.rtl.
cuda.so!
Libomptarget --> Done registering entries!
Libomptarget --> Entering target region with entry point 0x0000000000402028 and
device Id -1
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were fo
und)
Libomptarget --> Use default device id 0
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Call to omp_get_initial_device returning 4
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL --> Init requires flags to 1
Target CUDA RTL --> Getting device 0
Target CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC
Target CUDA RTL --> Max CUDA blocks per grid 2147483647 exceeds the hard team limit 65536, capping at the hard limit
Target CUDA RTL --> Using 1024 CUDA threads per block
Target CUDA RTL --> Using warp size 32
Target CUDA RTL --> Device supports up to 65536 CUDA blocks and 1024 threads with a warp size of 32
Target CUDA RTL --> Default number of teams set according to library's default 128
Target CUDA RTL --> Default number of threads set according to library's default 128
Libomptarget --> Device 0 is ready to use.
Target CUDA RTL --> Load data from image 0x0000000000402100
Target CUDA RTL --> CUDA module successfully loaded!
Target CUDA RTL --> Entry point 0x0000000000000000 maps to __omp_offloading_3a_8057b0f2__Z4funcPi_l11 (0x00000000034508b0)
Target CUDA RTL --> Entry point 0x0000000000000001 maps to __omp_offloading_3a_8057b0f2_main_l26 (0x00000000032f79b0)
Target CUDA RTL --> Entry point 0x0000000000000002 maps to __omp_offloading_3a_8057b0f2_main_l40 (0x00000000032fb5a0)
Target CUDA RTL --> Finding global device environment 'omptarget_device_environment' - symbol missing.
Target CUDA RTL --> Continue, considering this is a device RTL which does not accept environment setting.
Libomptarget --> Entry 0: Base=0x00007ffd9ddd39e0, Begin=0x00007ffd9ddd39e0, Size=4096, Type=0x223, Name=unknown
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=4096)...
Target CUDA RTL --> MemoryManagerTy::allocate: size 4096 with host pointer 0x0000000000000000.
Target CUDA RTL --> findBucket: Size 4096 is floored to 4096.
Target CUDA RTL --> Cannot find a node in the FreeLists. Allocate on device.
Target CUDA RTL --> Node address 0x0000000003426eb0, target pointer 0x000014c47da00000, size 4096
Libomptarget --> Creating new map entry: HstBase=0x00007ffd9ddd39e0, HstBegin=0x00007ffd9ddd39e0, HstEnd=0x00007ffd9ddd49e0, TgtBegin=0x000014c47da00000
Libomptarget --> There are 4096 bytes allocated at target address 0x000014c47da00000 - is new
Libomptarget --> Moving 4096 bytes (hst:0x00007ffd9ddd39e0) -> (tgt:0x000014c47da00000)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=4096)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffd9ddd39e0, TgtPtrBegin=0x000014c47da00000, Size=4096, RefCount=1
Libomptarget --> Obtained target argument 0x000014c47da00000 from host pointer 0x00007ffd9ddd39e0
Libomptarget --> Launching target execution __omp_offloading_3a_8057b0f2_main_l26 with pointer 0x0000000003425d30 (index=1).
Target CUDA RTL --> Setting CUDA threads per block to requested 1
Target CUDA RTL --> Adding master warp: +32 threads
Target CUDA RTL --> Using requested number of teams 1
Target CUDA RTL --> Launching kernel __omp_offloading_3a_8057b0f2_main_l26 with 1 blocks and 33 threads in Generic mode
Target CUDA RTL --> Launch of entry point at 0x0000000003425d30 successful!
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=4096)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffd9ddd39e0, TgtPtrBegin=0x000014c47da00000, Size=4096, updated RefCount=1
Libomptarget --> There are 4096 bytes allocated at target address 0x000014c47da00000 - is last
Libomptarget --> Moving 4096 bytes (tgt:0x000014c47da00000) -> (hst:0x00007ffd9ddd39e0)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=4096)...
Libomptarget --> Deleting tgt data 0x000014c47da00000 of size 4096
Target CUDA RTL --> MemoryManagerTy::free: target memory 0x000014c47da00000.
Target CUDA RTL --> findBucket: Size 4096 is floored to 4096.
Target CUDA RTL --> Found its node 0x0000000003426eb0. Insert it to bucket 11.
Libomptarget --> Removing mapping with HstPtrBegin=0x00007ffd9ddd39e0, TgtPtrBegin=0x000014c47da00000, Size=4096
Libomptarget --> Entering target region with entry point 0x0000000000402008 and device Id -1
Libomptarget --> Use default device id 0
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Call to omp_get_initial_device returning 4
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Libomptarget --> Entry 0: Base=0x00007ffd9ddd39e0, Begin=0x00007ffd9ddd39e0, Size=0, Type=0x220, Name=unknown
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=0)...
Libomptarget --> There are 0 bytes allocated at target address 0x0000000000000000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=0)...
Libomptarget --> Obtained target argument 0x0000000000000000 from host pointer 0x00007ffd9ddd39e0
Libomptarget --> Launching target execution __omp_offloading_3a_8057b0f2__Z4funcPi_l11 with pointer 0x00000000032f5ef0 (index=0).
Target CUDA RTL --> Setting CUDA threads per block to requested 1
Target CUDA RTL --> Adding master warp: +32 threads
Target CUDA RTL --> Using requested number of teams 1
Target CUDA RTL --> Launching kernel __omp_offloading_3a_8057b0f2__Z4funcPi_l11
with 1 blocks and 33 threads in Generic mode
Target CUDA RTL --> Launch of entry point at 0x00000000032f5ef0 successful!
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffd9ddd39e0, Size=0)...
Libomptarget --> There are 0 bytes allocated at target address 0x0000000000000000 - is not last
Target CUDA RTL --> Error when synchronizing stream. stream = 0x0000000000a9b9b0, async info ptr = 0x00007ffd9ddd38a0
Target CUDA RTL --> CUDA error is: an illegal memory access was encountered
Libomptarget --> Call to targetDataEnd failed, abort target.
Libomptarget --> Failed to process data after launching the kernel.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
The target application terminated with signal 6 (SIGABRT)
Processing events...
Saving temporary "/tmp/nsys-report-7c7b-d5a8-db0d-9486.qdstrm" file to disk...
Creating final output files...
^MProcessing [0% ]^MProcessing [2% ]^MProcessing [3% ]^MProcessing [==7% ]^MProcessing [===10% ]^MProcessing [========================42% ]^MProcessing [=============================================73% ]^MProcessing [=============================================74% ]^MProcessing [==============================================75% ]^MProcessing [===============================================76% ]^MProcessing [===============================================77% ]^MProcessing [=================================================80% ]^MProcessing [===================================================82% ]^MProcessing [==============================================================100%]^MProcessing [==============================================================100%]
Saved report file to "/tmp/nsys-report-7c7b-d5a8-db0d-9486.qdrep"
Report file moved to "/p/project/cjzam11/kitayama1/work/report11.qdrep"
JURECA-DC compute nodes are backed by CentOS 8 kernel, and in the config I see:
CONFIG_ARCH_HAS_HMM=y
CONFIG_MIGRATE_VMA_HELPER=y
CONFIG_DEV_PAGEMAP_OPS=y
CONFIG_HMM=y
CONFIG_HMM_MIRROR=y
..., but the one of the admins says, after contacting NVIDIA support, HMM
feature maturity is not known or minimal.
$ uname -a
Linux jrlogin07.jureca 4.18.0-193.14.2.el8_2.x86_64 #1 SMP Sun Jul 26 03:54:29
UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
I am wondering whether HMM has arch dependent code, if that's the case, then
I believe Luo's observation on implicit USM only worked on POWER.
Attached toshi4.cpp
(785 bytes, text/x-csrc): Test program in C++
toshi4.cpp
(785 bytes, text/x-csrc)The above test code in C++ fails at runtime on an A100 target. The Linux kernel is enabled HMM, which is necessary for USM to work.