ROCm / aomp

AOMP is an open source Clang/LLVM based compiler with added support for the OpenMP® API on Radeon™ GPUs. Use this repository for releases, issues, documentation, packaging, and examples.
https://github.com/ROCm/aomp
Apache License 2.0
204 stars 46 forks source link

Unable to run example AOMP program on V520 #193

Open drajarshi opened 3 years ago

drajarshi commented 3 years ago

I am trying to run a openMP program on a instance with AMD EPYC 7R32 CPU/ V520 GPU. This is on a AWS shared instance.

I installed AOMP 11.12.0 and the ROCm dependencies.

However, when I try to compile and run the veccopy example under AOMP install folder,

[ec2-user@ip-172-31-42-182 veccopy]$ sudo make run Makefile:28: AOMP not found at /root/rocm/aomp /usr/lib/aomp/bin/clang -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx900 veccopy.c -o veccopy ./veccopy [/root/git/aomp11/amd-llvm-project/openmp/libomptarget/plugins/amdgpu/impl/system.cpp:515] Initializing the hsa runtime failed: HSA_STATUS_ERROR_OUT_OF_RESOURCES make: *** [run] Error 1

I am unable to figure out the meaning of the above error and how to fix it.

Then I modified the Makefile to specify the GPU as gfx1011 (device type for V520) (line in bold),

[ec2-user@ip-172-31-42-182 veccopy]$ grep AOMP_GPU Makefile ....................... INSTALLED_GPU = $(shell $(AOMP)/bin/mygpu -d gfx900)# Default AOMP_GPU is gfx900 which is vega AOMP_GPU ?= $(INSTALLED_GPU) AOMP_GPU = gfx1011 # for the V520 device

......................

......................

[ec2-user@ip-172-31-42-182 veccopy]$ sudo make run Makefile:28: AOMP not found at /root/rocm/aomp /usr/lib/aomp/bin/clang -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx1011 veccopy.c -o veccopy clang-11: error: no such file or directory: 'libomptarget-amdgcn-gfx1011.bc' clang-11: error: no such file or directory: 'libaompextras-amdgcn-gfx1011.bc' make: *** [veccopy] Error 1

The bitcode file for gfx1011 is not available in the rocm install folder.

[ec2-user@ip-172-31-42-182 veccopy]$ find / -name libomptarget-amdgcn* 2>/dev/null /usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx700.bc /usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx701.bc /usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx801.bc /usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx803.bc /usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx900.bc /usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx902.bc /usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx906.bc /usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx908.bc

The same list above shows under /opt/rocm-4.0.0/llvm/lib/ as well.

Here's my rocm install list:

[ec2-user@ip-172-31-42-182 veccopy]$ rpm -qa | grep rocm rocm-dbgapi-0.42.0.40000-23.el7.x86_64 rocm-opencl-devel-3.6Beta_17_g875c1f8_rocm_rel_4.0_23-1.x86_64 rocm-device-libs-1.0.0.637_rocm_rel_4.0_23_db8c0c3-1.x86_64 rocm-gdb-10.1_rocm_rel_4.0_23-1.x86_64 hsa-rocr-dev-1.2.40000.0_rocm_rel_4.0_23_a5173c90-1.x86_64 rocminfo-1.40000.0-1.x86_64 rocm-opencl-3.6Beta_17_g875c1f8_rocm_rel_4.0_23-1.x86_64 rocm-clang-ocl-0.5.0.64_rocm_rel_4.0_23_50fb51a-1.x86_64 rocm-smi-lib64-2.9.0.9_rocm_rel_4.0_23_4b49d2d-1.x86_64 rocm-cmake-0.3.0.153_rocm_rel_4.0_23_1d1caa5-1.x86_64 rocm-dkms-4.0.0.40000-23.el7.x86_64 comgr-1.9.0.194_rocm_rel_4.0_23_0fa438b-1.x86_64 rocm-utils-4.0.0.40000-23.el7.x86_64 rocm-smi-3.8.0-1.el7.noarch rocm-dev-4.0.0.40000-23.el7.x86_64

Please suggest how to get the openMP examples to run successfully on the V520 GPU.

Thanks in advance.

Regards,

Rajarshi Das

drajarshi commented 3 years ago

I subsequently thought it might be due to both AOMP 11.12.0 (based on ROCm 3.10) and ROCm 4.0.0 being installed. Hence, I did a fresh install of ROCm 4.0.0 on a separate identical AWS instance.

In the /opt/rocm/llvm/examples/veccopy/ folder, I modified the Makefile with the following variable settings: AOMP_GPU=gfx900 OFFLOAD_DEBUG=1

Subsequently, I see the following output: _$ sudo make run Makefile:28: AOMP not found at /root/rocm/aomp DEBUG Mode ON LIBOMPTARGET_DEBUG=1 ./veccopy Libomptarget --> Loading RTLs... Libomptarget --> Loading library '/opt/rocm/llvm/lib-debug/libomptarget.rtl.x86_64.so'... Libomptarget --> Successfully loaded library '/opt/rocm/llvm/lib-debug/libomptarget.rtl.x86_64.so'! Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices! Libomptarget --> Loading library '/opt/rocm/llvm/lib-debug/libomptarget.rtl.hsa.so'... Target HSA RTL --> Start initializing HSA-ATMI Target HSA RTL --> There are 1 devices supporting HSA. Target HSA RTL --> Device 0: Initial groupsPerDevice 128 & threadsPerGroup 256 Libomptarget --> Successfully loaded library '/opt/rocm/llvm/lib-debug/libomptarget.rtl.hsa.so'! Libomptarget --> Registering RTL libomptarget.rtl.hsa.so supporting 1 devices! Libomptarget --> RTLs loaded! Libomptarget --> Image 0x0000000000400ec0 is NOT compatible with RTL libomptarget.rtl.x86_64.so! Libomptarget --> Image 0x0000000000400ec0 is compatible with RTL libomptarget.rtl.hsa.so! Libomptarget --> RTL 0x00000000015809b0 has index 0! Libomptarget --> Registering image 0x0000000000400ec0 with RTL libomptarget.rtl.hsa.so! Libomptarget --> Done registering entries! Libomptarget --> Call to omp_get_numdevices returning 1 Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found) Libomptarget --> Entering target region with entry point 0x0000000000400e50 and device Id -1 Libomptarget --> Checking whether device 0 is ready. Libomptarget --> Is the device 0 (local ID 0) initialized? 0 Target HSA RTL --> Init requires flags to 1 Target HSA RTL --> Initialize the device id: 0 Target HSA RTL --> Using 36 compute unis per grid Target HSA RTL --> Using 1024 ROCm blocks per grid Target HSA RTL --> Capped thread limit: 1024 Target HSA RTL --> Queried wavefront size: 32 Target HSA RTL --> Default number of teams set according to library's default 128 Target HSA RTL --> Default number of threads set according to library's default 256 Target HSA RTL --> Device 0: default limit for groupsPerDevice 1024 & threadsPerGroup 1024 Target HSA RTL --> Device 0: wavefront size 32, total threads 1024 x 1024 = 1048576 Libomptarget --> Device 0 is ready to use. Target HSA RTL --> "Module registering" failed Possible gpu arch mismatch: gfx1011, please check compiler: -march= flag Libomptarget --> Unable to generate entries table for device id 0. Libomptarget --> Failed to init globals on device 0 Libomptarget --> Failed to get device 0 ready Libomptarget fatal error 1: failure of target construct while offloading is mandatory make: *** [run] Aborted

What does the message 'Target HSA RTL --> "Module registering" failed refer to? The next line indicates a possible gpu arch mismatch since the GPU id of V520 is gfx1011 while I built the code for gfx900. The mygpu program (/opt/rocm/bin/mygpu) returns unknown. This is because the gputable.txt in the bin/ folder does not have an entry for gfx1011. So, if I set the variable AOMP_GPU=gfx1011 in the Makefile, the build step fails: clang-12: error: no such file or directory: 'libomptarget-amdgcn-gfx1011.bc' clang-12: error: no such file or directory: 'libaompextras-amdgcn-gfx1011.bc' clang-12: error: no such file or directory: 'libm-amdgcn-gfx1011.bc' make: *** [veccopy] Error 1 Is it possible to generate a gfx1011.bc from an existing .bc such as a gfx900.bc e.g., in order to get the veccopy example to build and run?

Thanks.

JonChesterfield commented 3 years ago

OpenMP does not yet support gfx10. You could create the corresponding gfx1011.bc file by adding the number to the devicertl cmake file, but the end result will not work correctly. I'll ping the team with this, see if we can raise the priority of gfx10 implementation.

drajarshi commented 3 years ago

Thanks @JonChesterfield for your comments. I didn't quite follow your suggestion about modifying the devicertl cmake file. So, I tried the approach below: I copied over the libomptarget-amdgcn-gfx900.bc, got the .ll and then replaced the gfx900 string with gfx1011 in the attributes. I saw stuff like +gfx9-insts but didn't add +gfx10-insts since I wasn't sure about it, and then set the Module ID as well to gfx1011 and assembled it again with: $ llvm-as <.ll> I then placed the .bc in the respective folders. This time around, the 'sudo make run' for the veccopy example completed, and I saw the following output: _[ec2-user@ip-172-31-42-182 veccopy]$ sudo make Makefile:28: AOMP not found at /root/rocm/aomp DEBUG Mode ON env LIBRARY_PATH=/usr/lib/aomp/lib-debug /usr/lib/aomp/bin/clang -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx1011 veccopy.c -o veccopy [ec2-user@ip-172-31-42-182 veccopy]$ sudo make run Makefile:28: AOMP not found at /root/rocm/aomp DEBUG Mode ON LIBOMPTARGET_DEBUG=1 ./veccopy Libomptarget --> Init target library! ompt_pre_init(): tool_setting = 1 ompt_pre_init(): ompt_enabled = 0 Libomptarget --> Loading RTLs... Libomptarget --> Loading library '/usr/lib/aomp/lib-debug/libomptarget.rtl.x86_64.so'... Libomptarget --> Successfully loaded library '/usr/lib/aomp/lib-debug/libomptarget.rtl.x86_64.so'! Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices! Libomptarget --> Loading library '/usr/lib/aomp/lib-debug/libomptarget.rtl.amdgpu.so'... Target AMDGPU RTL --> Start initializing HSA-ATMI Target AMDGPU RTL --> There are 1 devices supporting HSA. Target AMDGPU RTL --> Device 0: Initial groupsPerDevice 128 & threadsPerGroup 256 Libomptarget --> Successfully loaded library '/usr/lib/aomp/lib-debug/libomptarget.rtl.amdgpu.so'! Libomptarget --> Registering RTL libomptarget.rtl.amdgpu.so supporting 1 devices! Libomptarget --> RTLs loaded! Libomptarget --> Image 0x0000000000400ee0 is NOT compatible with RTL libomptarget.rtl.x86_64.so! Libomptarget --> Image 0x0000000000400ee0 is compatible with RTL libomptarget.rtl.amdgpu.so! Libomptarget --> RTL 0x00000000016f2840 has index 0! Libomptarget --> Registering image 0x0000000000400ee0 with RTL libomptarget.rtl.amdgpu.so! Libomptarget --> Done registering entries! Libomptarget --> Call to omp_get_num_devices returning 1 Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found) Libomptarget --> Entering target region with entry point 0x0000000000400e70 and device Id -1 Libomptarget --> Checking whether device 0 is ready. Libomptarget --> Is the device 0 (local ID 0) initialized? 0 Target AMDGPU RTL --> Init requires flags to 1 Target AMDGPU RTL --> Initialize the device id: 0 Target AMDGPU RTL --> Using 36 compute unis per grid Target AMDGPU RTL --> Using 1024 ROCm blocks per grid Target AMDGPU RTL --> Capped thread limit: 1024 Target AMDGPU RTL --> Queried wavefront size: 32 Target AMDGPU RTL --> Default number of teams = 1 * number of compute units 36 Target AMDGPU RTL --> Default number of threads set according to library's default 256 Target AMDGPU RTL --> Device 0: default limit for groupsPerDevice 1024 & threadsPerGroup 1024 Target AMDGPU RTL --> Device 0: wavefront size 32, total threads 1024 x 1024 = 1048576 Libomptarget --> Device 0 is ready to use. Target AMDGPU RTL --> Setting global device environment 12 bytes Target AMDGPU RTL --> "Module registering" succeeded Target AMDGPU RTL --> ATMI module successfully loaded! Target AMDGPU RTL --> to find the kernel name: omp_offloading_10302_140b27f_main_l18 size: 39 Target AMDGPU RTL --> KernDescVal size 8 does not match advertized size 7 for '__omp_offloading_10302_140b27f_main_l18_kern_desc' Target AMDGPU RTL --> After loading global for omp_offloading_10302_140b27f_main_l18_kern_desc KernDesc Target AMDGPU RTL --> KernDesc: Version: 2 Target AMDGPU RTL --> KernDesc: TSize: 7 Target AMDGPU RTL --> KernDesc: WG_Size: 0 Target AMDGPU RTL --> KernDesc: Mode: 0 Target AMDGPU RTL --> ExecModeVal 0 Target AMDGPU RTL --> Setting KernDescVal.WG_Size to default 256 Target AMDGPU RTL --> WGSizeVal 256 Target AMDGPU RTL --> "Loading KernDesc computation property" succeeded Target AMDGPU RTL --> Construct kernelinfo: ExecMode 0 Target AMDGPU RTL --> Entry point 0 maps to __omp_offloading_10302_140b27f_main_l18 Libomptarget --> Entry 0: Base=0x00000000000186a0, Begin=0x00000000000186a0, Size=4, Type=0x320 Libomptarget --> Entry 1: Base=0x00000000000186a0, Begin=0x00000000000186a0, Size=8, Type=0x320 Libomptarget --> Entry 2: Base=0x00007ffe39ce7900, Begin=0x00007ffe39ce7900, Size=400000, Type=0x22 Libomptarget --> Entry 3: Base=0x00000000000186a0, Begin=0x00000000000186a0, Size=8, Type=0x320 Libomptarget --> Entry 4: Base=0x00007ffe39c85e80, Begin=0x00007ffe39c85e80, Size=400000, Type=0x21 Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39ce7900, Size=400000)... Target AMDGPU RTL --> Tgt alloc data 400000 bytes, (tgt:00007f41fd406000). Libomptarget --> Creating new map entry: HstBase=0x00007ffe39ce7900, HstBegin=0x00007ffe39ce7900, HstEnd=0x00007ffe39d49380, TgtBegin=0x00007f41fd406000 Libomptarget --> There are 400000 bytes allocated at target address 0x00007f41fd406000 - is new Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39c85e80, Size=400000)... Target AMDGPU RTL --> Tgt alloc data 400000 bytes, (tgt:00007f41fd468000). Libomptarget --> Creating new map entry: HstBase=0x00007ffe39c85e80, HstBegin=0x00007ffe39c85e80, HstEnd=0x00007ffe39ce7900, TgtBegin=0x00007f41fd468000 Libomptarget --> There are 400000 bytes allocated at target address 0x00007f41fd468000 - is new Libomptarget --> Moving 400000 bytes (hst:0x00007ffe39c85e80) -> (tgt:0x00007f41fd468000) Target AMDGPU RTL --> Submit data 400000 bytes, (hst:00007ffe39c85e80) -> (tgt:00007f41fd468000). Libomptarget --> Forwarding first-private value 0x00000000000186a0 to the target construct Libomptarget --> Forwarding first-private value 0x00000000000186a0 to the target construct Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39ce7900, Size=400000)... Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffe39ce7900, TgtPtrBegin=0x00007f41fd406000, Size=400000, RefCount=1 Libomptarget --> Obtained target argument 0x00007f41fd406000 from host pointer 0x00007ffe39ce7900 Libomptarget --> Forwarding first-private value 0x00000000000186a0 to the target construct Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39c85e80, Size=400000)... Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffe39c85e80, TgtPtrBegin=0x00007f41fd468000, Size=400000, RefCount=1 Libomptarget --> Obtained target argument 0x00007f41fd468000 from host pointer 0x00007ffe39c85e80 Libomptarget --> Launching target execution __omp_offloading_10302_140b27f_main_l18 with pointer 0x0000000001730c80 (index=0). Target AMDGPU RTL --> Run target team region thread_limit 0 Target AMDGPU RTL --> Arg_num: 5 Target AMDGPU RTL --> Offseted base: arg[0]:0x00000000000186a0 Target AMDGPU RTL --> Offseted base: arg[1]:0x00000000000186a0 Target AMDGPU RTL --> Offseted base: arg[2]:0x00007f41fd406000 Target AMDGPU RTL --> Offseted base: arg[3]:0x00000000000186a0 Target AMDGPU RTL --> Offseted base: arg[4]:0x00007f41fd468000 Target AMDGPU RTL --> Preparing 256 threads Target AMDGPU RTL --> Set default num of groups 36 Target AMDGPU RTL --> Final 1 numgroups and 256 threadsPerGroup Target AMDGPU RTL --> Kernel completed Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39c85e80, Size=400000)... Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffe39c85e80, TgtPtrBegin=0x00007f41fd468000, Size=400000, updated RefCount=1 Libomptarget --> There are 400000 bytes allocated at target address 0x00007f41fd468000 - is last Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39c85e80, Size=400000)... Libomptarget --> Deleting tgt data 0x00007f41fd468000 of size 400000 Target AMDGPU RTL --> Tgt free data (tgt:00007f41fd468000). Libomptarget --> Removing mapping with HstPtrBegin=0x00007ffe39c85e80, TgtPtrBegin=0x00007f41fd468000, Size=400000 Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39ce7900, Size=400000)... Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffe39ce7900, TgtPtrBegin=0x00007f41fd406000, Size=400000, updated RefCount=1 Libomptarget --> There are 400000 bytes allocated at target address 0x00007f41fd406000 - is last Libomptarget --> Moving 400000 bytes (tgt:0x00007f41fd406000) -> (hst:0x00007ffe39ce7900) Target AMDGPU RTL --> Retrieve data 400000 bytes, (tgt:00007f41fd406000) -> (hst:00007ffe39ce7900). Target AMDGPU RTL --> DONE Retrieve data 400000 bytes, (tgt:00007f41fd406000) -> (hst:00007ffe39ce7900). Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39ce7900, Size=400000)... Libomptarget --> Deleting tgt data 0x00007f41fd406000 of size 400000 Target AMDGPU RTL --> Tgt free data (tgt:00007f41fd406000). Libomptarget --> Removing mapping with HstPtrBegin=0x00007ffe39ce7900, TgtPtrBegin=0x00007f41fd406000, Size=400000 Success Target AMDGPU RTL --> Finalizing the HSA-ATMI DeviceInfo. Libomptarget --> Unloading target library! Libomptarget --> Image 0x0000000000400ee0 is compatible with RTL 0x00000000016f2840! Libomptarget --> Unregistered image 0x0000000000400ee0 from RTL 0x00000000016f2840! Libomptarget --> Done unregistering images! Libomptarget --> Removing translation table for descriptor 0x0000000000404740 Libomptarget --> Done unregistering library! Libomptarget --> Deinit target library!

I am assuming that the above debug output shows that the veccopy example runs on a gfx1011 device like a gfx9* device. Is this ok to use as a workaround in the absence of the actual gfx1011.bc (with the gfx10 features enabled)? Please let me know your thoughts.

Also, please let me know how to register for a notification once the gfx1011.bc is built and available to use. Thanks.

JonChesterfield commented 3 years ago

GFX10 is not expected to work on aomp. It's near the top of my todo list.

That trace shows it worked better than expected (except that the runtime should probably have said 'gfx10 is unsupported, sorry' and aborted). LLVM's backend is expected to work for gfx10, but the various places in openmp that assume a wavefront size of 64 will be incorrect for gfx10 (as it has a wavefront size of 32). That might work out for some simple cases as it sort of looks like a 64 wide machine with the top half inactive.

The cmake I meant is the one at https://github.com/ROCm-Developer-Tools/llvm-project/blob/aomp13.0-2/openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt where a variable LIBOMPTARGET_AMDGCN_GFXLIST controls which architectures are built.

I don't know of a notification system I could use. I'll probably remember to ping this thread once it's passing our tests, but unfortunately github is routed to my spam folder so there's some lag.

gregrodgers commented 3 years ago

AOMP support for gfx10 is TBD. See issue 187.

drajarshi commented 3 years ago

Thank you Greg, Jon for prioritizing AOMP support for gfx10. Appreciate it.

On Tue, Apr 20, 2021 at 6:59 PM Greg Rodgers @.***> wrote:

AOMP support for gfx10 is TBD. See issue 187.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ROCm-Developer-Tools/aomp/issues/193#issuecomment-823274741, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPFG4FQVGNMVPKPYOCUCLDTJV6TPANCNFSM4Y4F2TWA .

JonChesterfield commented 3 years ago

Some support for gfx10 is in trunk now. It isn't heavily tested yet and has not yet reached aomp. Patch enabling it was https://reviews.llvm.org/D108708

ppanchad-amd commented 2 months ago

@drajarshi Do you still need assistance with this ticket? If not, please close the ticket. Thanks!