Open WojciechMigda opened 2 years ago
List of Tesla NVidia accelerators: https://en.wikipedia.org/wiki/Nvidia_Tesla
Kaggle notebook: Tesla P100 - 16GB (Pascal), C.A. 6.0 Colab notebook: Tesla K80 (Kepler), C.A. 3.7; Tesla T4 (Turing), C.A. 7.5.
nvprof
.https://developer.nvidia.com/blog/even-easier-introduction-cuda/
https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
GCC OpenMP status: https://gcc.gnu.org/wiki/openmp
Implementation status (4.5, 5.0, 5.1): https://gcc.gnu.org/onlinedocs/libgomp/OpenMP-Implementation-Status.html
OpenMP 4.5 is supported for C/C++ since GCC 6 and since GCC 7 for Fortran (with omissions, largest missing item is structure element mapping). Since GCC 9, there is initial OpenMP 5 support (essentially C/C++, only). GCC 10 added some more features, mainly for C/C++ but also for Fortran.
GCC 11 extended the Fortran compiler to fully support OpenMP 4.5 and support more OpenMP 5.0; additionally, nonrectangular loops are now supported.
GCC 12 (ongoing development) has a growing support of OpenMP 5.0 and first support of OpenMP 5.1 features.
GCC OpenMP Offloading status: https://gcc.gnu.org/wiki/Offloading
(libgomp plugins, like GOMP_OFFLOAD_get_name
, are listed here.)
GCC: NVPTX (Parallel Thread eXecution) https://gcc.gnu.org/wiki/nvptx NVidia PTX spec pages, including ISA: https://docs.nvidia.com/cuda/parallel-thread-execution/
gcc-8
and NVidia T4 accelerator (Google Colab notebook)printf "#include <omp.h>\n#include <stdio.h>\n#include <stdlib.h>\n\nint main()\n{\n printf(\"%%d\\\n\", omp_get_num_devices());\n size_t const n = 400000000;\n double *A = (double *)malloc(n * sizeof (double));\n double *B = (double *)malloc(n * sizeof (double));\n double *C = (double *)malloc(n * sizeof (double));\n double const scalar = 2;\n\n#pragma omp target enter data map(to: B[0:n], C[0:n]) map(alloc: A[0:n])\n#pragma omp target teams distribute parallel for\n for (size_t i = 0; i < n; ++i)\n {\n A[i] = B[i] + scalar * C[i];\n }\n\n return 0;\n}\n" > main.c && cat main.c
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
printf("%d\n", omp_get_num_devices());
size_t const n = 400000000;
double *A = (double *)malloc(n * sizeof (double));
double *B = (double *)malloc(n * sizeof (double));
double *C = (double *)malloc(n * sizeof (double));
double const scalar = 2;
#pragma omp target enter data map(to: B[0:n], C[0:n]) map(alloc: A[0:n])
#pragma omp target teams distribute parallel for
for (size_t i = 0; i < n; ++i)
{
A[i] = B[i] + scalar * C[i];
}
return 0;
}
nvidia-smi
Thu Jan 20 14:16:34 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 44C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
gcc-8
and its nvptx
support packagesudo apt-get install gcc-8 gcc-8-offload-nvptx -y
gcc-8 -fopenmp -foffload=disable -fno-stack-protector main.c -o main && time ./main
1
tcmalloc: large alloc 3200000000 bytes == 0x55ab3de8c000 @ 0x7fa7530db1e7 0x55ab3c70ea20 0x7fa752aa9bf7 0x55ab3c70e92a
tcmalloc: large alloc 3200000000 bytes == 0x55abfca4e000 @ 0x7fa7530db1e7 0x55ab3c70ea34 0x7fa752aa9bf7 0x55ab3c70e92a
tcmalloc: large alloc 3200000000 bytes == 0x55acbb610000 @ 0x7fa7530db1e7 0x55ab3c70ea48 0x7fa752aa9bf7 0x55ab3c70e92a
real 0m4.129s
user 0m2.780s
sys 0m2.915s
nvptx
offloadingIf one attempts to build offloading example code with an invocation like this:
gcc-8 -fopenmp -foffload=nvptx-none -fno-stack-protector main.c -o main
Then, at least on Google's colab notebook, it fails with error:
ptxas fatal : Value 'sm_30' is not defined for option 'gpu-name'
nvptx-as: ptxas returned 255 exit status
mkoffload: fatal error: x86_64-linux-gnu-accel-nvptx-none-gcc-8 returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /usr/lib/gcc/x86_64-linux-gnu/8//accel/nvptx-none/mkoffload returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
We have to explicitly specify gpu targets like this:
gcc-8 -fopenmp -foffload=nvptx-none="-Wa,-mcompute_37 -Wa,-mcompute_75" -fno-stack-protector main.c -o main
Then, when we run:
gcc-8 -fopenmp -foffload=nvptx-none="-Wa,-mcompute_37 -Wa,-mcompute_75" -fno-stack-protector main.c -o main && time ./main && nvprof ./main
we'll get
1
tcmalloc: large alloc 3200000000 bytes == 0x564f2ca44000 @ 0x7fd6a83b41e7 0x564f29eebe10 0x7fd6a7d82bf7 0x564f29eebd1a
tcmalloc: large alloc 3200000000 bytes == 0x564feb606000 @ 0x7fd6a83b41e7 0x564f29eebe24 0x7fd6a7d82bf7 0x564f29eebd1a
tcmalloc: large alloc 3200000000 bytes == 0x5650aa1c8000 @ 0x7fd6a83b41e7 0x564f29eebe38 0x7fd6a7d82bf7 0x564f29eebd1a
real 0m2.782s
user 0m1.417s
sys 0m1.342s
==961== NVPROF is profiling process 961, command: ./main
1
tcmalloc: large alloc 3200000000 bytes == 0x5611ab522000 @ 0x7f42ed95b1e7 0x5611a6426e10 0x7f42ed329bf7 0x5611a6426d1a
tcmalloc: large alloc 3200000000 bytes == 0x56126a0e4000 @ 0x7f42ed95b1e7 0x5611a6426e24 0x7f42ed329bf7 0x5611a6426d1a
tcmalloc: large alloc 3200000000 bytes == 0x561328ca6000 @ 0x7f42ed95b1e7 0x5611a6426e38 0x7f42ed329bf7 0x5611a6426d1a
==961== Profiling application: ./main
==961== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 85.42% 1.99808s 3 666.03ms 1.6640us 1.00249s [CUDA memcpy HtoD]
14.58% 341.17ms 1 341.17ms 341.17ms 341.17ms main$_omp_fn$0
API calls: 73.61% 1.99876s 3 666.25ms 25.721us 1.00285s cuMemcpyHtoD
12.57% 341.18ms 1 341.18ms 341.18ms 341.18ms cuCtxSynchronize
10.34% 280.74ms 1 280.74ms 280.74ms 280.74ms cuCtxCreate
2.65% 71.827ms 1 71.827ms 71.827ms 71.827ms cuCtxDestroy
0.39% 10.609ms 5 2.1218ms 211.92us 4.4222ms cuMemAlloc
0.16% 4.4237ms 1 4.4237ms 4.4237ms 4.4237ms cuModuleLoadData
0.11% 3.0056ms 24 125.23us 22.038us 1.3387ms cuLinkAddData
0.05% 1.4452ms 1 1.4452ms 1.4452ms 1.4452ms cuMemFreeHost
0.05% 1.3060ms 2 653.02us 227.90us 1.0781ms cuMemFree
0.03% 922.53us 1 922.53us 922.53us 922.53us cuMemAllocHost
0.03% 759.75us 1 759.75us 759.75us 759.75us cuLinkComplete
0.01% 169.11us 1 169.11us 169.11us 169.11us cuLaunchKernel
0.00% 81.382us 14 5.8130us 145ns 72.963us cuDeviceGetAttribute
0.00% 44.781us 1 44.781us 44.781us 44.781us cuLinkCreate
0.00% 7.5620us 11 687ns 311ns 1.0340us cuCtxGetDevice
0.00% 6.8260us 1 6.8260us 6.8260us 6.8260us cuDeviceGetPCIBusId
0.00% 3.8700us 4 967ns 698ns 1.2300us cuMemGetAddressRange
0.00% 2.8870us 1 2.8870us 2.8870us 2.8870us cuInit
0.00% 2.4120us 4 603ns 220ns 1.4480us cuDeviceGetCount
0.00% 1.7410us 1 1.7410us 1.7410us 1.7410us cuLinkDestroy
0.00% 1.4560us 2 728ns 465ns 991ns cuDeviceGet
0.00% 1.0110us 1 1.0110us 1.0110us 1.0110us cuModuleGetFunction
0.00% 900ns 1 900ns 900ns 900ns cuMemHostGetDevicePointer
0.00% 700ns 2 350ns 268ns 432ns cuFuncGetAttribute
0.00% 646ns 1 646ns 646ns 646ns cuModuleGetGlobal
0.00% 527ns 1 527ns 527ns 527ns cuCtxGetCurrent
It's an improvement, but the code spends 2 seconds on transfering memory from host to device alone.
Adding #pragma omp requires unified_shared_memory
does not improve the result.
Above output also shows that kernel execution took 341ms.
If we list contents of .rodata
section we will see PTX mnemonics of the accelerator code:
readelf -x 16 main | head -n 16
Hex dump of section '.rodata':
0x000012c0 01000200 00000000 25640a00 00000000 ........%d......
0x000012d0 00000000 00000040 00000000 00000000 .......@........
0x000012e0 58102700 00000000 60102700 00000000 X.'.....`.'.....
0x000012f0 60102700 00000000 60102700 00000000 `.'.....`.'.....
0x00001300 2f2f2042 4547494e 20505245 414d424c // BEGIN PREAMBL
0x00001310 450a2e76 65727369 6f6e2033 2e310a2e E..version 3.1..
0x00001320 74617267 65742073 6d5f3330 0a2e6164 target sm_30..ad
0x00001330 64726573 735f7369 7a652036 340a2f2f dress_size 64.//
0x00001340 20424547 494e2046 554e4354 494f4e20 BEGIN FUNCTION
0x00001350 4445434c 3a206d61 696e245f 6f6d705f DECL: main$_omp_
0x00001360 666e2430 24696d70 6c0a2e66 756e6320 fn$0$impl..func
0x00001370 6d61696e 245f6f6d 705f666e 24302469 main$_omp_fn$0$i
0x00001380 6d706c20 282e7061 72616d20 2e753634 mpl (.param .u64
0x00001390 2025696e 5f617230 293b0a2f 2f204245 %in_ar0);.// BE
Note, however, that .target
directive argument is sm_30
, so it might be that -mcompute_37
and -mcompute_75
arguments were silently ignored.
readelf
also lists presence of explicitly offloading section .gnu.offload_func
.
clang-8
and clang-9
build attempt on Ubuntu18.04 (Google Colab notebook)These clang versions fail to build at all:
clang++-9 -v -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_37 main.c -o main
This invocation is not that much different from the one summarized here (Table 1): https://developer.ibm.com/articles/gpu-programming-with-openmp/ Might be worth trying with clang at least version 11.
clang version 9.0.0-2~ubuntu18.04.2 (tags/RELEASE_900/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/8
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/8
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/8
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda-11.1, version 7.0
clang: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
clang: error: cannot find libdevice for sm_37. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice.
clang-8
fails for the same reason.
nvcc-11.1
and NVidia T4 accelerator (Google Colab notebook)nvcc -Xcompiler -fopenmp main.c -o main && time ./main && nvprof ./main
1
tcmalloc: large alloc 3200000000 bytes == 0x558b70f62000 @ 0x7f44811721e7 0x558b6edbb360 0x7f4480515bf7 0x558b6edbb21a
tcmalloc: large alloc 3200000000 bytes == 0x558c2fb24000 @ 0x7f44811721e7 0x558b6edbb377 0x7f4480515bf7 0x558b6edbb21a
tcmalloc: large alloc 3200000000 bytes == 0x558cee6e6000 @ 0x7f44811721e7 0x558b6edbb38e 0x7f4480515bf7 0x558b6edbb21a
real 0m4.066s
user 0m2.712s
sys 0m2.877s
==2578== NVPROF is profiling process 2578, command: ./main
1
tcmalloc: large alloc 3200000000 bytes == 0x55a94be9a000 @ 0x7fe0441041e7 0x55a946fb2360 0x7fe0434a7bf7 0x55a946fb221a
tcmalloc: large alloc 3200000000 bytes == 0x55aa0aa5c000 @ 0x7fe0441041e7 0x55a946fb2377 0x7fe0434a7bf7 0x55a946fb221a
tcmalloc: large alloc 3200000000 bytes == 0x55aac961e000 @ 0x7fe0441041e7 0x55a946fb238e 0x7fe0434a7bf7 0x55a946fb221a
==2578== Profiling application: ./main
==2578== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 1.98750s 2 993.75ms 991.50ms 996.00ms [CUDA memcpy HtoD]
API calls: 84.83% 1.98807s 2 994.04ms 991.79ms 996.28ms cuMemcpyHtoD
11.71% 274.50ms 1 274.50ms 274.50ms 274.50ms cuCtxCreate
3.02% 70.881ms 1 70.881ms 70.881ms 70.881ms cuCtxDestroy
0.37% 8.6671ms 3 2.8890ms 2.7305ms 3.0686ms cuMemAlloc
0.04% 907.57us 1 907.57us 907.57us 907.57us cuMemAllocHost
0.02% 506.97us 1 506.97us 506.97us 506.97us cuMemFreeHost
0.00% 86.269us 14 6.1620us 144ns 81.703us cuDeviceGetAttribute
0.00% 15.024us 7 2.1460us 384ns 6.8720us cuCtxGetDevice
0.00% 6.7160us 1 6.7160us 6.7160us 6.7160us cuDeviceGetPCIBusId
0.00% 2.5650us 1 2.5650us 2.5650us 2.5650us cuInit
0.00% 2.2340us 4 558ns 210ns 992ns cuDeviceGetCount
0.00% 2.1350us 2 1.0670us 1.0200us 1.1150us cuMemGetAddressRange
0.00% 1.5940us 2 797ns 649ns 945ns cuDeviceGet
0.00% 1.1060us 1 1.1060us 1.1060us 1.1060us cuMemHostGetDevicePointer
0.00% 819ns 1 819ns 819ns 819ns cuCtxGetCurrent
It doesn't seem to run a GPU kernel at all? TODO
gcc-9
and NVidia P100 accelerator (Kaggle notebook)Install nvptx plugin
sudo apt-get install gcc-9 gcc-9-offload-nvptx -y
build and run offloading example:
gcc-9 -fopenmp -foffload=nvptx-none="-Wa,-mcompute_60 -Wa,-mcompute_75" -fcf-protection=none -fno-stack-protector main.c -o main && time ./main && nvprof ./main
gives:
1
real 0m3.339s
user 0m1.902s
sys 0m1.411s
==366== NVPROF is profiling process 366, command: ./main
1
==366== Profiling application: ./main
==366== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 83.40% 2.58467s 3 861.56ms 1.5360us 1.31000s [CUDA memcpy HtoD]
16.60% 514.53ms 1 514.53ms 514.53ms 514.53ms main$_omp_fn$0
API calls: 73.28% 2.58550s 3 861.83ms 16.094us 1.31042s cuMemcpyHtoD
14.58% 514.54ms 1 514.54ms 514.54ms 514.54ms cuCtxSynchronize
8.40% 296.40ms 1 296.40ms 296.40ms 296.40ms cuCtxCreate
2.82% 99.506ms 1 99.506ms 99.506ms 99.506ms cuCtxDestroy
0.37% 12.934ms 5 2.5867ms 242.49us 4.2285ms cuMemAlloc
0.28% 10.002ms 1 10.002ms 10.002ms 10.002ms cuModuleLoadData
0.10% 3.5596ms 2 1.7798ms 362.68us 3.1969ms cuMemFree
0.09% 3.1014ms 24 129.22us 33.860us 1.2526ms cuLinkAddData
0.05% 1.7514ms 1 1.7514ms 1.7514ms 1.7514ms cuLaunchKernel
0.02% 648.79us 1 648.79us 648.79us 648.79us cuLinkComplete
0.00% 90.953us 16 5.6840us 195ns 84.889us cuDeviceGetAttribute
0.00% 39.150us 1 39.150us 39.150us 39.150us cuLinkCreate
0.00% 15.031us 1 15.031us 15.031us 15.031us cuDeviceGetName
0.00% 5.6290us 11 511ns 263ns 761ns cuCtxGetDevice
0.00% 5.3840us 1 5.3840us 5.3840us 5.3840us cuDeviceGetPCIBusId
0.00% 4.1830us 4 1.0450us 787ns 1.3960us cuMemGetAddressRange
0.00% 3.0490us 1 3.0490us 3.0490us 3.0490us cuInit
0.00% 2.2980us 4 574ns 267ns 1.0770us cuDeviceGetCount
0.00% 2.0280us 1 2.0280us 2.0280us 2.0280us cuLinkDestroy
0.00% 1.4030us 2 701ns 423ns 980ns cuDeviceGet
0.00% 1.3420us 2 671ns 574ns 768ns cuFuncGetAttribute
0.00% 1.1350us 1 1.1350us 1.1350us 1.1350us cuModuleGetFunction
0.00% 622ns 1 622ns 622ns 622ns cuModuleGetGlobal
0.00% 431ns 1 431ns 431ns 431ns cuCtxGetCurrent
0.00% 346ns 1 346ns 346ns 346ns cuDriverGetVersion
If one tries to add #pragma omp requires unified_shared_memory
then gcc-9
at least shows an error that it's not implemented:
main.c: In function ‘main’:
main.c:14:22: sorry, unimplemented: ‘unified_shared_memory’ clause on ‘requires’ directive not supported yet
14 | #pragma omp requires unified_shared_memory
|
If we list contents of .rodata section we will see PTX mnemonics of the accelerator code:
readelf -x 16 main | head -n 16
ex dump of section '.rodata':
0x00002000 01000200 00000000 25640a00 00000000 ........%d......
0x00002010 00000000 00000040 00000000 00000000 .......@........
0x00002020 58a00700 00000000 60a00700 00000000 X.......`.......
0x00002030 60a00700 00000000 60a00700 00000000 `.......`.......
0x00002040 2f2f2042 4547494e 20505245 414d424c // BEGIN PREAMBL
0x00002050 450a2e76 65727369 6f6e2033 2e310a2e E..version 3.1..
0x00002060 74617267 65742073 6d5f3330 0a2e6164 target sm_30..ad
0x00002070 64726573 735f7369 7a652036 340a2f2f dress_size 64.//
0x00002080 20424547 494e2046 554e4354 494f4e20 BEGIN FUNCTION
0x00002090 4445434c 3a206d61 696e245f 6f6d705f DECL: main$_omp_
0x000020a0 666e2430 24696d70 6c0a2e66 756e6320 fn$0$impl..func
0x000020b0 6d61696e 245f6f6d 705f666e 24302469 main$_omp_fn$0$i
0x000020c0 6d706c20 282e7061 72616d20 2e753634 mpl (.param .u64
0x000020d0 2025696e 5f617230 293b0a2f 2f204245 %in_ar0);.// BE
Interestingly, just iike with gcc-8
on ubuntu18.04, only .target sm_30
directives are found in built executable.
This command builds our example:
!nvcc -v -Xcompiler -fcf-protection=none -Xcompiler -fno-stack-protector -Xcompiler -fopenmp -Xcompiler -foffload=-misa=sm_35 -gencode arch=compute_60,code=sm_60 -Xptxas -mcompute_60 main.c -o main && time ./main && nvprof ./main
#$ _NVVM_BRANCH_=nvvm
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/usr/local/cuda/bin
#$ _THERE_=/usr/local/cuda/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_DIR_=targets/x86_64-linux
#$ TOP=/usr/local/cuda/bin/..
#$ NVVMIR_LIBRARY_DIR=/usr/local/cuda/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/usr/local/cuda/bin/../lib:/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
#$ PATH=/usr/local/cuda/bin/../nvvm/bin:/usr/local/cuda/bin:/opt/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
#$ INCLUDES="-I/usr/local/cuda/bin/../targets/x86_64-linux/include"
#$ LIBRARIES= "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -c -x c -D__NVCC__ -fcf-protection=none -fno-stack-protector -fopenmp -foffload=-misa=sm_35 "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=0 -D__CUDACC_VER_BUILD__=221 -m64 "main.c" -o "/tmp/tmpxft_0000090c_00000000-5_main.o"
#$ nvlink --arch=sm_60 --register-link-binaries="/tmp/tmpxft_0000090c_00000000-3_main_dlink.reg.c" -m64 "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib" -cpu-arch=X86_64 "/tmp/tmpxft_0000090c_00000000-5_main.o" -lcudadevrt -o "/tmp/tmpxft_0000090c_00000000-6_main_dlink.cubin"
#$ fatbinary -64 --cmdline="-mcompute_60 " -link "--image3=kind=elf,sm=60,file=/tmp/tmpxft_0000090c_00000000-6_main_dlink.cubin" --embedded-fatbin="/tmp/tmpxft_0000090c_00000000-4_main_dlink.fatbin.c"
#$ rm /tmp/tmpxft_0000090c_00000000-4_main_dlink.fatbin
#$ gcc -c -x c++ -DFATBINFILE="\"/tmp/tmpxft_0000090c_00000000-4_main_dlink.fatbin.c\"" -DREGISTERLINKBINARYFILE="\"/tmp/tmpxft_0000090c_00000000-3_main_dlink.reg.c\"" -I. -D__NV_EXTRA_INITIALIZATION= -D__NV_EXTRA_FINALIZATION= -D__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__ -fcf-protection=none -fno-stack-protector -fopenmp -foffload=-misa=sm_35 "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=0 -D__CUDACC_VER_BUILD__=221 -m64 "/usr/local/cuda/bin/crt/link.stub" -o "/tmp/tmpxft_0000090c_00000000-7_main_dlink.o"
#$ g++ -fcf-protection=none -fno-stack-protector -fopenmp -foffload=-misa=sm_35 -m64 -Wl,--start-group "/tmp/tmpxft_0000090c_00000000-7_main_dlink.o" "/tmp/tmpxft_0000090c_00000000-5_main.o" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib" -lcudadevrt -lcudart_static -lrt -lpthread -ldl -Wl,--end-group -o "main"
1
real 0m3.271s
user 0m1.906s
sys 0m1.365s
==2351== NVPROF is profiling process 2351, command: ./main
1
==2351== Profiling application: ./main
==2351== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 83.16% 2.66507s 3 888.36ms 1.6320us 1.40679s [CUDA memcpy HtoD]
16.84% 539.57ms 1 539.57ms 539.57ms 539.57ms main$_omp_fn$0
API calls: 73.57% 2.66584s 3 888.61ms 23.415us 1.40716s cuMemcpyHtoD
14.89% 539.58ms 1 539.58ms 539.58ms 539.58ms cuCtxSynchronize
7.92% 286.84ms 1 286.84ms 286.84ms 286.84ms cuCtxCreate
2.75% 99.770ms 1 99.770ms 99.770ms 99.770ms cuCtxDestroy
0.33% 11.952ms 5 2.3905ms 306.80us 3.7904ms cuMemAlloc
0.27% 9.7247ms 1 9.7247ms 9.7247ms 9.7247ms cuModuleLoadData
0.10% 3.6351ms 2 1.8176ms 371.87us 3.2633ms cuMemFree
0.08% 2.9327ms 24 122.19us 30.928us 1.1967ms cuLinkAddData
0.07% 2.4453ms 1 2.4453ms 2.4453ms 2.4453ms cuLaunchKernel
0.02% 584.54us 1 584.54us 584.54us 584.54us cuLinkComplete
0.00% 82.989us 16 5.1860us 203ns 75.802us cuDeviceGetAttribute
0.00% 38.688us 1 38.688us 38.688us 38.688us cuLinkCreate
0.00% 14.421us 1 14.421us 14.421us 14.421us cuDeviceGetName
0.00% 9.5120us 11 864ns 251ns 4.1980us cuCtxGetDevice
0.00% 4.2390us 4 1.0590us 833ns 1.3180us cuMemGetAddressRange
0.00% 3.9010us 1 3.9010us 3.9010us 3.9010us cuInit
0.00% 2.9070us 1 2.9070us 2.9070us 2.9070us cuDeviceGetPCIBusId
0.00% 2.4500us 4 612ns 271ns 1.2100us cuDeviceGetCount
0.00% 2.0020us 1 2.0020us 2.0020us 2.0020us cuLinkDestroy
0.00% 1.2240us 2 612ns 479ns 745ns cuDeviceGet
0.00% 1.2170us 2 608ns 516ns 701ns cuFuncGetAttribute
0.00% 1.0010us 1 1.0010us 1.0010us 1.0010us cuModuleGetFunction
0.00% 765ns 1 765ns 765ns 765ns cuModuleGetGlobal
0.00% 642ns 1 642ns 642ns 642ns cuCtxGetCurrent
0.00% 500ns 1 500ns 500ns 500ns cuDriverGetVersion
This time nvprof
profiler shows that the kernel was executed.
But verbose output we enabled shows that nvcc
underneath calls g++
anyway, like, wtf?
If we look inside executable's .rodata
section we will see that there .target sm_30
directive, one we get with gcc-8
, but also .target sm_35
one.
It appears that gcc
paired with nvptx
offload plugin is the easiest to use and works across different environments.
References:
[1] Offloading: https://bisqwit.iki.fi/story/howto/openmp/#OffloadingSupport [2] Offloading: https://www.lumi-supercomputer.eu/offloading-code-with-compiler-directives/ [3] Very nice Offloading introduction: https://www.youtube.com/watch?v=uVcvecgdW7g (slides)