WojciechMigda / Tsetlini

Efficient parallelized implementation of Multilabel Classifier and Regressor Tsetlin Machines
https://wojciechmigda.github.io/Tsetlini/
MIT License
4 stars 0 forks source link

Evaluate OpenMP-based GPU usage #113

Open WojciechMigda opened 2 years ago

WojciechMigda commented 2 years ago

References:

[1] Offloading: https://bisqwit.iki.fi/story/howto/openmp/#OffloadingSupport [2] Offloading: https://www.lumi-supercomputer.eu/offloading-code-with-compiler-directives/ [3] Very nice Offloading introduction: https://www.youtube.com/watch?v=uVcvecgdW7g (slides)

WojciechMigda commented 2 years ago

List of Tesla NVidia accelerators: https://en.wikipedia.org/wiki/Nvidia_Tesla

Kaggle notebook: Tesla P100 - 16GB (Pascal), C.A. 6.0 Colab notebook: Tesla K80 (Kepler), C.A. 3.7; Tesla T4 (Turing), C.A. 7.5.

WojciechMigda commented 2 years ago

https://developer.nvidia.com/blog/even-easier-introduction-cuda/

WojciechMigda commented 2 years ago

https://developer.nvidia.com/blog/unified-memory-in-cuda-6/

WojciechMigda commented 2 years ago

https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

WojciechMigda commented 2 years ago

GCC OpenMP status: https://gcc.gnu.org/wiki/openmp

Implementation status (4.5, 5.0, 5.1): https://gcc.gnu.org/onlinedocs/libgomp/OpenMP-Implementation-Status.html

OpenMP 4.5 is supported for C/C++ since GCC 6 and since GCC 7 for Fortran (with omissions, largest missing item is structure element mapping). Since GCC 9, there is initial OpenMP 5 support (essentially C/C++, only). GCC 10 added some more features, mainly for C/C++ but also for Fortran.

GCC 11 extended the Fortran compiler to fully support OpenMP 4.5 and support more OpenMP 5.0; additionally, nonrectangular loops are now supported.

GCC 12 (ongoing development) has a growing support of OpenMP 5.0 and first support of OpenMP 5.1 features.

WojciechMigda commented 2 years ago

GCC OpenMP Offloading status: https://gcc.gnu.org/wiki/Offloading (libgomp plugins, like GOMP_OFFLOAD_get_name, are listed here.)

GCC: NVPTX (Parallel Thread eXecution) https://gcc.gnu.org/wiki/nvptx NVidia PTX spec pages, including ISA: https://docs.nvidia.com/cuda/parallel-thread-execution/

WojciechMigda commented 2 years ago

Building and running offloaded triad code on Ubuntu18.04 with gcc-8 and NVidia T4 accelerator (Google Colab notebook)

Create example source file

printf "#include <omp.h>\n#include <stdio.h>\n#include <stdlib.h>\n\nint main()\n{\n    printf(\"%%d\\\n\", omp_get_num_devices());\n    size_t const n = 400000000;\n    double *A = (double *)malloc(n * sizeof (double));\n    double *B = (double *)malloc(n * sizeof (double));\n    double *C = (double *)malloc(n * sizeof (double));\n    double const scalar = 2;\n\n#pragma omp target enter data map(to: B[0:n], C[0:n]) map(alloc: A[0:n])\n#pragma omp target teams distribute parallel for\n    for (size_t i = 0; i < n; ++i)\n    {\n        A[i] = B[i] + scalar * C[i];\n    }\n\n    return 0;\n}\n" > main.c && cat main.c
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main()
{
    printf("%d\n", omp_get_num_devices());
    size_t const n = 400000000;
    double *A = (double *)malloc(n * sizeof (double));
    double *B = (double *)malloc(n * sizeof (double));
    double *C = (double *)malloc(n * sizeof (double));
    double const scalar = 2;

#pragma omp target enter data map(to: B[0:n], C[0:n]) map(alloc: A[0:n])
#pragma omp target teams distribute parallel for
    for (size_t i = 0; i < n; ++i)
    {
        A[i] = B[i] + scalar * C[i];
    }

    return 0;
}

Query present accelerator

nvidia-smi
Thu Jan 20 14:16:34 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Install gcc-8 and its nvptx support package

sudo apt-get install gcc-8 gcc-8-offload-nvptx -y

Build and execute example code with DISABLED offloading (Xeon cpu)

gcc-8 -fopenmp -foffload=disable -fno-stack-protector main.c -o main  && time ./main
1
tcmalloc: large alloc 3200000000 bytes == 0x55ab3de8c000 @  0x7fa7530db1e7 0x55ab3c70ea20 0x7fa752aa9bf7 0x55ab3c70e92a
tcmalloc: large alloc 3200000000 bytes == 0x55abfca4e000 @  0x7fa7530db1e7 0x55ab3c70ea34 0x7fa752aa9bf7 0x55ab3c70e92a
tcmalloc: large alloc 3200000000 bytes == 0x55acbb610000 @  0x7fa7530db1e7 0x55ab3c70ea48 0x7fa752aa9bf7 0x55ab3c70e92a

real    0m4.129s
user    0m2.780s
sys 0m2.915s

Build and execute example code with ENABLED nvptx offloading

If one attempts to build offloading example code with an invocation like this:

gcc-8 -fopenmp -foffload=nvptx-none -fno-stack-protector main.c -o main

Then, at least on Google's colab notebook, it fails with error:

ptxas fatal   : Value 'sm_30' is not defined for option 'gpu-name'
nvptx-as: ptxas returned 255 exit status
mkoffload: fatal error: x86_64-linux-gnu-accel-nvptx-none-gcc-8 returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /usr/lib/gcc/x86_64-linux-gnu/8//accel/nvptx-none/mkoffload returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

We have to explicitly specify gpu targets like this:

gcc-8 -fopenmp -foffload=nvptx-none="-Wa,-mcompute_37 -Wa,-mcompute_75" -fno-stack-protector main.c -o main

Then, when we run:

gcc-8 -fopenmp -foffload=nvptx-none="-Wa,-mcompute_37 -Wa,-mcompute_75" -fno-stack-protector main.c -o main  && time ./main && nvprof ./main

we'll get

1
tcmalloc: large alloc 3200000000 bytes == 0x564f2ca44000 @  0x7fd6a83b41e7 0x564f29eebe10 0x7fd6a7d82bf7 0x564f29eebd1a
tcmalloc: large alloc 3200000000 bytes == 0x564feb606000 @  0x7fd6a83b41e7 0x564f29eebe24 0x7fd6a7d82bf7 0x564f29eebd1a
tcmalloc: large alloc 3200000000 bytes == 0x5650aa1c8000 @  0x7fd6a83b41e7 0x564f29eebe38 0x7fd6a7d82bf7 0x564f29eebd1a

real    0m2.782s
user    0m1.417s
sys 0m1.342s
==961== NVPROF is profiling process 961, command: ./main
1
tcmalloc: large alloc 3200000000 bytes == 0x5611ab522000 @  0x7f42ed95b1e7 0x5611a6426e10 0x7f42ed329bf7 0x5611a6426d1a
tcmalloc: large alloc 3200000000 bytes == 0x56126a0e4000 @  0x7f42ed95b1e7 0x5611a6426e24 0x7f42ed329bf7 0x5611a6426d1a
tcmalloc: large alloc 3200000000 bytes == 0x561328ca6000 @  0x7f42ed95b1e7 0x5611a6426e38 0x7f42ed329bf7 0x5611a6426d1a
==961== Profiling application: ./main
==961== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   85.42%  1.99808s         3  666.03ms  1.6640us  1.00249s  [CUDA memcpy HtoD]
                   14.58%  341.17ms         1  341.17ms  341.17ms  341.17ms  main$_omp_fn$0
      API calls:   73.61%  1.99876s         3  666.25ms  25.721us  1.00285s  cuMemcpyHtoD
                   12.57%  341.18ms         1  341.18ms  341.18ms  341.18ms  cuCtxSynchronize
                   10.34%  280.74ms         1  280.74ms  280.74ms  280.74ms  cuCtxCreate
                    2.65%  71.827ms         1  71.827ms  71.827ms  71.827ms  cuCtxDestroy
                    0.39%  10.609ms         5  2.1218ms  211.92us  4.4222ms  cuMemAlloc
                    0.16%  4.4237ms         1  4.4237ms  4.4237ms  4.4237ms  cuModuleLoadData
                    0.11%  3.0056ms        24  125.23us  22.038us  1.3387ms  cuLinkAddData
                    0.05%  1.4452ms         1  1.4452ms  1.4452ms  1.4452ms  cuMemFreeHost
                    0.05%  1.3060ms         2  653.02us  227.90us  1.0781ms  cuMemFree
                    0.03%  922.53us         1  922.53us  922.53us  922.53us  cuMemAllocHost
                    0.03%  759.75us         1  759.75us  759.75us  759.75us  cuLinkComplete
                    0.01%  169.11us         1  169.11us  169.11us  169.11us  cuLaunchKernel
                    0.00%  81.382us        14  5.8130us     145ns  72.963us  cuDeviceGetAttribute
                    0.00%  44.781us         1  44.781us  44.781us  44.781us  cuLinkCreate
                    0.00%  7.5620us        11     687ns     311ns  1.0340us  cuCtxGetDevice
                    0.00%  6.8260us         1  6.8260us  6.8260us  6.8260us  cuDeviceGetPCIBusId
                    0.00%  3.8700us         4     967ns     698ns  1.2300us  cuMemGetAddressRange
                    0.00%  2.8870us         1  2.8870us  2.8870us  2.8870us  cuInit
                    0.00%  2.4120us         4     603ns     220ns  1.4480us  cuDeviceGetCount
                    0.00%  1.7410us         1  1.7410us  1.7410us  1.7410us  cuLinkDestroy
                    0.00%  1.4560us         2     728ns     465ns     991ns  cuDeviceGet
                    0.00%  1.0110us         1  1.0110us  1.0110us  1.0110us  cuModuleGetFunction
                    0.00%     900ns         1     900ns     900ns     900ns  cuMemHostGetDevicePointer
                    0.00%     700ns         2     350ns     268ns     432ns  cuFuncGetAttribute
                    0.00%     646ns         1     646ns     646ns     646ns  cuModuleGetGlobal
                    0.00%     527ns         1     527ns     527ns     527ns  cuCtxGetCurrent

It's an improvement, but the code spends 2 seconds on transfering memory from host to device alone.

Adding #pragma omp requires unified_shared_memory does not improve the result.

Above output also shows that kernel execution took 341ms.

If we list contents of .rodata section we will see PTX mnemonics of the accelerator code:

readelf -x 16 main | head -n 16
Hex dump of section '.rodata':
  0x000012c0 01000200 00000000 25640a00 00000000 ........%d......
  0x000012d0 00000000 00000040 00000000 00000000 .......@........
  0x000012e0 58102700 00000000 60102700 00000000 X.'.....`.'.....
  0x000012f0 60102700 00000000 60102700 00000000 `.'.....`.'.....
  0x00001300 2f2f2042 4547494e 20505245 414d424c // BEGIN PREAMBL
  0x00001310 450a2e76 65727369 6f6e2033 2e310a2e E..version 3.1..
  0x00001320 74617267 65742073 6d5f3330 0a2e6164 target sm_30..ad
  0x00001330 64726573 735f7369 7a652036 340a2f2f dress_size 64.//
  0x00001340 20424547 494e2046 554e4354 494f4e20  BEGIN FUNCTION 
  0x00001350 4445434c 3a206d61 696e245f 6f6d705f DECL: main$_omp_
  0x00001360 666e2430 24696d70 6c0a2e66 756e6320 fn$0$impl..func 
  0x00001370 6d61696e 245f6f6d 705f666e 24302469 main$_omp_fn$0$i
  0x00001380 6d706c20 282e7061 72616d20 2e753634 mpl (.param .u64
  0x00001390 2025696e 5f617230 293b0a2f 2f204245  %in_ar0);.// BE

Note, however, that .target directive argument is sm_30, so it might be that -mcompute_37 and -mcompute_75 arguments were silently ignored.

readelf also lists presence of explicitly offloading section .gnu.offload_func.

WojciechMigda commented 2 years ago

clang-8 and clang-9 build attempt on Ubuntu18.04 (Google Colab notebook)

These clang versions fail to build at all:

clang++-9 -v -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_37 main.c -o main

This invocation is not that much different from the one summarized here (Table 1): https://developer.ibm.com/articles/gpu-programming-with-openmp/ Might be worth trying with clang at least version 11.

clang version 9.0.0-2~ubuntu18.04.2 (tags/RELEASE_900/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/8
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/8
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/8
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda-11.1, version 7.0
clang: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
clang: error: cannot find libdevice for sm_37. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice.

clang-8 fails for the same reason.

WojciechMigda commented 2 years ago

Building and running offloaded triad code on ubuntu18.04 with nvcc-11.1 and NVidia T4 accelerator (Google Colab notebook)

nvcc -Xcompiler -fopenmp main.c -o main  && time ./main && nvprof ./main
1
tcmalloc: large alloc 3200000000 bytes == 0x558b70f62000 @  0x7f44811721e7 0x558b6edbb360 0x7f4480515bf7 0x558b6edbb21a
tcmalloc: large alloc 3200000000 bytes == 0x558c2fb24000 @  0x7f44811721e7 0x558b6edbb377 0x7f4480515bf7 0x558b6edbb21a
tcmalloc: large alloc 3200000000 bytes == 0x558cee6e6000 @  0x7f44811721e7 0x558b6edbb38e 0x7f4480515bf7 0x558b6edbb21a

real    0m4.066s
user    0m2.712s
sys 0m2.877s
==2578== NVPROF is profiling process 2578, command: ./main
1
tcmalloc: large alloc 3200000000 bytes == 0x55a94be9a000 @  0x7fe0441041e7 0x55a946fb2360 0x7fe0434a7bf7 0x55a946fb221a
tcmalloc: large alloc 3200000000 bytes == 0x55aa0aa5c000 @  0x7fe0441041e7 0x55a946fb2377 0x7fe0434a7bf7 0x55a946fb221a
tcmalloc: large alloc 3200000000 bytes == 0x55aac961e000 @  0x7fe0441041e7 0x55a946fb238e 0x7fe0434a7bf7 0x55a946fb221a
==2578== Profiling application: ./main
==2578== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.98750s         2  993.75ms  991.50ms  996.00ms  [CUDA memcpy HtoD]
      API calls:   84.83%  1.98807s         2  994.04ms  991.79ms  996.28ms  cuMemcpyHtoD
                   11.71%  274.50ms         1  274.50ms  274.50ms  274.50ms  cuCtxCreate
                    3.02%  70.881ms         1  70.881ms  70.881ms  70.881ms  cuCtxDestroy
                    0.37%  8.6671ms         3  2.8890ms  2.7305ms  3.0686ms  cuMemAlloc
                    0.04%  907.57us         1  907.57us  907.57us  907.57us  cuMemAllocHost
                    0.02%  506.97us         1  506.97us  506.97us  506.97us  cuMemFreeHost
                    0.00%  86.269us        14  6.1620us     144ns  81.703us  cuDeviceGetAttribute
                    0.00%  15.024us         7  2.1460us     384ns  6.8720us  cuCtxGetDevice
                    0.00%  6.7160us         1  6.7160us  6.7160us  6.7160us  cuDeviceGetPCIBusId
                    0.00%  2.5650us         1  2.5650us  2.5650us  2.5650us  cuInit
                    0.00%  2.2340us         4     558ns     210ns     992ns  cuDeviceGetCount
                    0.00%  2.1350us         2  1.0670us  1.0200us  1.1150us  cuMemGetAddressRange
                    0.00%  1.5940us         2     797ns     649ns     945ns  cuDeviceGet
                    0.00%  1.1060us         1  1.1060us  1.1060us  1.1060us  cuMemHostGetDevicePointer
                    0.00%     819ns         1     819ns     819ns     819ns  cuCtxGetCurrent

It doesn't seem to run a GPU kernel at all? TODO

WojciechMigda commented 2 years ago

Building and running offloaded triad code on Ubuntu20.04 with gcc-9 and NVidia P100 accelerator (Kaggle notebook)

Install nvptx plugin

sudo apt-get install gcc-9 gcc-9-offload-nvptx -y

build and run offloading example:

gcc-9 -fopenmp -foffload=nvptx-none="-Wa,-mcompute_60 -Wa,-mcompute_75" -fcf-protection=none -fno-stack-protector main.c -o main  && time ./main && nvprof ./main

gives:

1

real    0m3.339s
user    0m1.902s
sys 0m1.411s
==366== NVPROF is profiling process 366, command: ./main
1
==366== Profiling application: ./main
==366== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   83.40%  2.58467s         3  861.56ms  1.5360us  1.31000s  [CUDA memcpy HtoD]
                   16.60%  514.53ms         1  514.53ms  514.53ms  514.53ms  main$_omp_fn$0
      API calls:   73.28%  2.58550s         3  861.83ms  16.094us  1.31042s  cuMemcpyHtoD
                   14.58%  514.54ms         1  514.54ms  514.54ms  514.54ms  cuCtxSynchronize
                    8.40%  296.40ms         1  296.40ms  296.40ms  296.40ms  cuCtxCreate
                    2.82%  99.506ms         1  99.506ms  99.506ms  99.506ms  cuCtxDestroy
                    0.37%  12.934ms         5  2.5867ms  242.49us  4.2285ms  cuMemAlloc
                    0.28%  10.002ms         1  10.002ms  10.002ms  10.002ms  cuModuleLoadData
                    0.10%  3.5596ms         2  1.7798ms  362.68us  3.1969ms  cuMemFree
                    0.09%  3.1014ms        24  129.22us  33.860us  1.2526ms  cuLinkAddData
                    0.05%  1.7514ms         1  1.7514ms  1.7514ms  1.7514ms  cuLaunchKernel
                    0.02%  648.79us         1  648.79us  648.79us  648.79us  cuLinkComplete
                    0.00%  90.953us        16  5.6840us     195ns  84.889us  cuDeviceGetAttribute
                    0.00%  39.150us         1  39.150us  39.150us  39.150us  cuLinkCreate
                    0.00%  15.031us         1  15.031us  15.031us  15.031us  cuDeviceGetName
                    0.00%  5.6290us        11     511ns     263ns     761ns  cuCtxGetDevice
                    0.00%  5.3840us         1  5.3840us  5.3840us  5.3840us  cuDeviceGetPCIBusId
                    0.00%  4.1830us         4  1.0450us     787ns  1.3960us  cuMemGetAddressRange
                    0.00%  3.0490us         1  3.0490us  3.0490us  3.0490us  cuInit
                    0.00%  2.2980us         4     574ns     267ns  1.0770us  cuDeviceGetCount
                    0.00%  2.0280us         1  2.0280us  2.0280us  2.0280us  cuLinkDestroy
                    0.00%  1.4030us         2     701ns     423ns     980ns  cuDeviceGet
                    0.00%  1.3420us         2     671ns     574ns     768ns  cuFuncGetAttribute
                    0.00%  1.1350us         1  1.1350us  1.1350us  1.1350us  cuModuleGetFunction
                    0.00%     622ns         1     622ns     622ns     622ns  cuModuleGetGlobal
                    0.00%     431ns         1     431ns     431ns     431ns  cuCtxGetCurrent
                    0.00%     346ns         1     346ns     346ns     346ns  cuDriverGetVersion

If one tries to add #pragma omp requires unified_shared_memory then gcc-9 at least shows an error that it's not implemented:

main.c: In function ‘main’:
main.c:14:22: sorry, unimplemented: ‘unified_shared_memory’ clause on ‘requires’ directive not supported yet
   14 | #pragma omp requires unified_shared_memory
      |       

If we list contents of .rodata section we will see PTX mnemonics of the accelerator code:

readelf -x 16 main | head -n 16
ex dump of section '.rodata':
  0x00002000 01000200 00000000 25640a00 00000000 ........%d......
  0x00002010 00000000 00000040 00000000 00000000 .......@........
  0x00002020 58a00700 00000000 60a00700 00000000 X.......`.......
  0x00002030 60a00700 00000000 60a00700 00000000 `.......`.......
  0x00002040 2f2f2042 4547494e 20505245 414d424c // BEGIN PREAMBL
  0x00002050 450a2e76 65727369 6f6e2033 2e310a2e E..version 3.1..
  0x00002060 74617267 65742073 6d5f3330 0a2e6164 target sm_30..ad
  0x00002070 64726573 735f7369 7a652036 340a2f2f dress_size 64.//
  0x00002080 20424547 494e2046 554e4354 494f4e20  BEGIN FUNCTION 
  0x00002090 4445434c 3a206d61 696e245f 6f6d705f DECL: main$_omp_
  0x000020a0 666e2430 24696d70 6c0a2e66 756e6320 fn$0$impl..func 
  0x000020b0 6d61696e 245f6f6d 705f666e 24302469 main$_omp_fn$0$i
  0x000020c0 6d706c20 282e7061 72616d20 2e753634 mpl (.param .u64
  0x000020d0 2025696e 5f617230 293b0a2f 2f204245  %in_ar0);.// BE

Interestingly, just iike with gcc-8 on ubuntu18.04, only .target sm_30 directives are found in built executable.

WojciechMigda commented 2 years ago

Building and running offloaded triad code on ubuntu20.04 with nvcc-11.0 and NVidia P100 accelerator (Kaggle notebook)

This command builds our example:

!nvcc -v -Xcompiler -fcf-protection=none -Xcompiler -fno-stack-protector -Xcompiler -fopenmp -Xcompiler -foffload=-misa=sm_35 -gencode arch=compute_60,code=sm_60 -Xptxas -mcompute_60 main.c -o main  && time ./main && nvprof ./main
#$ _NVVM_BRANCH_=nvvm
#$ _SPACE_= 
#$ _CUDART_=cudart
#$ _HERE_=/usr/local/cuda/bin
#$ _THERE_=/usr/local/cuda/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_DIR_=targets/x86_64-linux
#$ TOP=/usr/local/cuda/bin/..
#$ NVVMIR_LIBRARY_DIR=/usr/local/cuda/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/usr/local/cuda/bin/../lib:/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
#$ PATH=/usr/local/cuda/bin/../nvvm/bin:/usr/local/cuda/bin:/opt/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
#$ INCLUDES="-I/usr/local/cuda/bin/../targets/x86_64-linux/include"  
#$ LIBRARIES=  "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -c -x c -D__NVCC__  -fcf-protection=none -fno-stack-protector -fopenmp -foffload=-misa=sm_35 "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=0 -D__CUDACC_VER_BUILD__=221 -m64 "main.c" -o "/tmp/tmpxft_0000090c_00000000-5_main.o" 
#$ nvlink --arch=sm_60 --register-link-binaries="/tmp/tmpxft_0000090c_00000000-3_main_dlink.reg.c"  -m64   "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib" -cpu-arch=X86_64 "/tmp/tmpxft_0000090c_00000000-5_main.o"  -lcudadevrt  -o "/tmp/tmpxft_0000090c_00000000-6_main_dlink.cubin"
#$ fatbinary -64 --cmdline="-mcompute_60  " -link "--image3=kind=elf,sm=60,file=/tmp/tmpxft_0000090c_00000000-6_main_dlink.cubin" --embedded-fatbin="/tmp/tmpxft_0000090c_00000000-4_main_dlink.fatbin.c" 
#$ rm /tmp/tmpxft_0000090c_00000000-4_main_dlink.fatbin
#$ gcc -c -x c++ -DFATBINFILE="\"/tmp/tmpxft_0000090c_00000000-4_main_dlink.fatbin.c\"" -DREGISTERLINKBINARYFILE="\"/tmp/tmpxft_0000090c_00000000-3_main_dlink.reg.c\"" -I. -D__NV_EXTRA_INITIALIZATION= -D__NV_EXTRA_FINALIZATION= -D__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__  -fcf-protection=none -fno-stack-protector -fopenmp -foffload=-misa=sm_35 "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=0 -D__CUDACC_VER_BUILD__=221 -m64 "/usr/local/cuda/bin/crt/link.stub" -o "/tmp/tmpxft_0000090c_00000000-7_main_dlink.o" 
#$ g++ -fcf-protection=none -fno-stack-protector -fopenmp -foffload=-misa=sm_35 -m64 -Wl,--start-group "/tmp/tmpxft_0000090c_00000000-7_main_dlink.o" "/tmp/tmpxft_0000090c_00000000-5_main.o"   "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib"  -lcudadevrt  -lcudart_static  -lrt -lpthread  -ldl  -Wl,--end-group -o "main" 
1

real    0m3.271s
user    0m1.906s
sys 0m1.365s
==2351== NVPROF is profiling process 2351, command: ./main
1
==2351== Profiling application: ./main
==2351== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   83.16%  2.66507s         3  888.36ms  1.6320us  1.40679s  [CUDA memcpy HtoD]
                   16.84%  539.57ms         1  539.57ms  539.57ms  539.57ms  main$_omp_fn$0
      API calls:   73.57%  2.66584s         3  888.61ms  23.415us  1.40716s  cuMemcpyHtoD
                   14.89%  539.58ms         1  539.58ms  539.58ms  539.58ms  cuCtxSynchronize
                    7.92%  286.84ms         1  286.84ms  286.84ms  286.84ms  cuCtxCreate
                    2.75%  99.770ms         1  99.770ms  99.770ms  99.770ms  cuCtxDestroy
                    0.33%  11.952ms         5  2.3905ms  306.80us  3.7904ms  cuMemAlloc
                    0.27%  9.7247ms         1  9.7247ms  9.7247ms  9.7247ms  cuModuleLoadData
                    0.10%  3.6351ms         2  1.8176ms  371.87us  3.2633ms  cuMemFree
                    0.08%  2.9327ms        24  122.19us  30.928us  1.1967ms  cuLinkAddData
                    0.07%  2.4453ms         1  2.4453ms  2.4453ms  2.4453ms  cuLaunchKernel
                    0.02%  584.54us         1  584.54us  584.54us  584.54us  cuLinkComplete
                    0.00%  82.989us        16  5.1860us     203ns  75.802us  cuDeviceGetAttribute
                    0.00%  38.688us         1  38.688us  38.688us  38.688us  cuLinkCreate
                    0.00%  14.421us         1  14.421us  14.421us  14.421us  cuDeviceGetName
                    0.00%  9.5120us        11     864ns     251ns  4.1980us  cuCtxGetDevice
                    0.00%  4.2390us         4  1.0590us     833ns  1.3180us  cuMemGetAddressRange
                    0.00%  3.9010us         1  3.9010us  3.9010us  3.9010us  cuInit
                    0.00%  2.9070us         1  2.9070us  2.9070us  2.9070us  cuDeviceGetPCIBusId
                    0.00%  2.4500us         4     612ns     271ns  1.2100us  cuDeviceGetCount
                    0.00%  2.0020us         1  2.0020us  2.0020us  2.0020us  cuLinkDestroy
                    0.00%  1.2240us         2     612ns     479ns     745ns  cuDeviceGet
                    0.00%  1.2170us         2     608ns     516ns     701ns  cuFuncGetAttribute
                    0.00%  1.0010us         1  1.0010us  1.0010us  1.0010us  cuModuleGetFunction
                    0.00%     765ns         1     765ns     765ns     765ns  cuModuleGetGlobal
                    0.00%     642ns         1     642ns     642ns     642ns  cuCtxGetCurrent
                    0.00%     500ns         1     500ns     500ns     500ns  cuDriverGetVersion

This time nvprof profiler shows that the kernel was executed. But verbose output we enabled shows that nvcc underneath calls g++ anyway, like, wtf?

If we look inside executable's .rodata section we will see that there .target sm_30 directive, one we get with gcc-8, but also .target sm_35 one.

WojciechMigda commented 2 years ago

Summary

It appears that gcc paired with nvptx offload plugin is the easiest to use and works across different environments.