root@jetson-tx2-devkit:~# python3 ./demo_cdpSimplePrint.py
starting Simple Print (CUDA Dynamic Parallelism)
***************************************************************************
The CPU launches 2 blocks of 2 threads each. On the device each thread will
launch 2 blocks of 2 threads each. The GPU we will do that recursively
until it reaches max_depth=2
In total 2
+8
=10 blocks are launched!!! (8 from the GPU)
***************************************************************************
Launching cdp_kernel() with CUDA Dynamic Parallelism:
BLOCK 0 launched by the host
BLOCK 1 launched by the host
| BLOCK 3 launched by thread 0 of block 1
| BLOCK 2 launched by thread 0 of block 0
| BLOCK 5 launched by thread 0 of block 1
| BLOCK 4 launched by thread 0 of block 0
| BLOCK 6 launched by thread 1 of block 1
| BLOCK 7 launched by thread 1 of block 0
| BLOCK 9 launched by thread 1 of block 0
| BLOCK 8 launched by thread 1 of block 1
root@jetson-tx2-devkit:~#
Tests were done on
jetson-tx2-devkit
with imagedemo-image-full
Add the following lines into your
conf/local.conf
Running
pycude
sample applications1- Running demo.py sample application:
2- Running demo_cdpSimplePrint.py sample application: