dzhoshkun / cuda-learning

BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Mandelbrot exercise questions #9

Closed dzhoshkun closed 6 years ago

dzhoshkun commented 6 years ago
dzhoshkun commented 6 years ago
[0] % nvprof ./mandelbrot
==6343== NVPROF is profiling process 6343, command: ./mandelbrot
Mandelbrot set computed in 0.208 s, at 1287.527 Mpix/s
Copying took 0.467 s
==6343== Profiling application: ./mandelbrot
==6343== Profiling result:
            Type  Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 GPU activities:   69.08%  466.23ms             1               -  466.23ms  466.23ms  466.23ms  [CUDA memcpy DtoH]
                   30.63%  206.69ms             0               4  51.673ms  31.036ms  72.328ms  mandelbrot_pixel_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int)
                    0.29%  1.9665ms             1               1  983.25us  295.84us  1.6707ms  mandelbrot_block_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int, int)
      API calls:   48.75%  466.99ms             1               -  466.99ms  466.99ms  466.99ms  cudaMemcpy
                   29.29%  280.57ms             1               -  280.57ms  280.57ms  280.57ms  cudaMalloc
                   21.71%  207.95ms             1               -  207.95ms  207.95ms  207.95ms  cudaThreadSynchronize
                    0.08%  809.10us             1               -  809.10us  809.10us  809.10us  cudaFree
                    0.06%  615.37us            94               -  6.5460us     768ns  244.02us  cuDeviceGetAttribute
                    0.05%  517.24us             1               -  517.24us  517.24us  517.24us  cudaLaunch
                    0.04%  377.21us             1               -  377.21us  377.21us  377.21us  cuDeviceTotalMem
                    0.01%  67.536us             1               -  67.536us  67.536us  67.536us  cuDeviceGetName
                    0.00%  21.162us             3               -  7.0540us  5.7270us  8.8700us  cudaEventRecord
                    0.00%  8.2410us             9               -     915ns     628ns  2.8630us  cudaSetupArgument
                    0.00%  7.3350us             3               -  2.4450us  1.0480us  5.0990us  cudaEventCreate
                    0.00%  4.6100us             3               -  1.5360us     978ns  2.5840us  cudaEventDestroy
                    0.00%  4.3310us             2               -  2.1650us  1.4670us  2.8640us  cudaEventElapsedTime
                    0.00%  4.1910us             3               -  1.3970us     839ns  2.3040us  cuDeviceGetCount
                    0.00%  3.6320us             1               -  3.6320us  3.6320us  3.6320us  cudaEventSynchronize
                    0.00%  2.4440us             2               -  1.2220us     768ns  1.6760us  cuDeviceGet
                    0.00%  1.3270us             1               -  1.3270us  1.3270us  1.3270us  cudaConfigureCall
dzhoshkun commented 6 years ago
[0] % nvprof ./mandelbrot
==6526== NVPROF is profiling process 6526, command: ./mandelbrot
Mandelbrot set computed in 0.101 s, at 2646.876 Mpix/s
Copying took 0.492 s
==6526== Profiling application: ./mandelbrot
==6526== Profiling result:
            Type  Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 GPU activities:   67.27%  491.25ms             1               -  491.25ms  491.25ms  491.25ms  [CUDA memcpy DtoH]
                   15.06%  109.94ms             0             170  646.71us  67.329us  2.6077ms  mandelbrot_pixel_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int)
                    9.24%  67.506ms             0              38  1.7765ms  37.408us  37.502ms  iter_fill_k(int*, int, int, int, int, int)
                    8.43%  61.523ms             1              69  878.90us  63.360us  4.5692ms  mandelbrot_block_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int, int)
      API calls:   53.82%  492.01ms             1               -  492.01ms  492.01ms  492.01ms  cudaMemcpy
                   34.85%  318.62ms             1               -  318.62ms  318.62ms  318.62ms  cudaMalloc
                   10.99%  100.49ms             1               -  100.49ms  100.49ms  100.49ms  cudaThreadSynchronize
                    0.10%  932.93us            94               -  9.9240us     768ns  408.50us  cuDeviceGetAttribute
                    0.10%  890.12us             1               -  890.12us  890.12us  890.12us  cudaLaunch
                    0.08%  757.98us             1               -  757.98us  757.98us  757.98us  cudaFree
                    0.04%  387.06us             1               -  387.06us  387.06us  387.06us  cuDeviceTotalMem
                    0.01%  93.307us             1               -  93.307us  93.307us  93.307us  cuDeviceGetName
                    0.00%  26.819us             3               -  8.9390us  7.8220us  10.267us  cudaEventRecord
                    0.00%  11.175us             3               -  3.7250us  1.7460us  7.4730us  cudaEventCreate
                    0.00%  11.174us             9               -  1.2410us     699ns  4.3300us  cudaSetupArgument
                    0.00%  4.5400us             3               -  1.5130us     838ns  2.5840us  cuDeviceGetCount
                    0.00%  4.2610us             3               -  1.4200us     978ns  2.3050us  cudaEventDestroy
                    0.00%  4.1910us             2               -  2.0950us  1.3970us  2.7940us  cudaEventElapsedTime
                    0.00%  3.4920us             1               -  3.4920us  3.4920us  3.4920us  cudaEventSynchronize
                    0.00%  3.2130us             2               -  1.6060us     908ns  2.3050us  cuDeviceGet
                    0.00%  1.9560us             1               -  1.9560us  1.9560us  1.9560us  cudaConfigureCall
dzhoshkun commented 6 years ago
[0] % nvprof ./mandelbrot
==6691== NVPROF is profiling process 6691, command: ./mandelbrot
Mandelbrot set computed in 0.071 s, at 3789.035 Mpix/s
Copying took 0.501 s
==6691== Profiling application: ./mandelbrot
==6691== Profiling result:
            Type  Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 GPU activities:   37.52%  793.39ms             1            7732  102.60us  41.888us  2.1085ms  mandelbrot_block_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int, int)
                   27.90%  589.93ms             0           15512  38.030us  5.8880us  281.25us  mandelbrot_pixel_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int)
                   23.64%  499.86ms             1               -  499.86ms  499.86ms  499.86ms  [CUDA memcpy DtoH]
                   10.93%  231.15ms             0            7685  30.077us  4.8960us  64.651ms  iter_fill_k(int*, int, int, int, int, int)
      API calls:   56.10%  500.63ms             1               -  500.63ms  500.63ms  500.63ms  cudaMemcpy
                   35.70%  318.62ms             1               -  318.62ms  318.62ms  318.62ms  cudaMalloc
                    7.84%  69.928ms             1               -  69.928ms  69.928ms  69.928ms  cudaThreadSynchronize
                    0.11%  1.0191ms            94               -  10.841us     768ns  459.48us  cuDeviceGetAttribute
                    0.10%  886.70us             1               -  886.70us  886.70us  886.70us  cudaLaunch
                    0.09%  764.97us             1               -  764.97us  764.97us  764.97us  cudaFree
                    0.05%  411.15us             1               -  411.15us  411.15us  411.15us  cuDeviceTotalMem
                    0.01%  79.549us             1               -  79.549us  79.549us  79.549us  cuDeviceGetName
                    0.00%  25.353us             3               -  8.4510us  7.1940us  9.7780us  cudaEventRecord
                    0.00%  12.151us             3               -  4.0500us  1.6760us  8.4500us  cudaEventCreate
                    0.00%  10.754us             9               -  1.1940us     698ns  3.9110us  cudaSetupArgument
                    0.00%  4.9590us             3               -  1.6530us     978ns  2.8640us  cudaEventDestroy
                    0.00%  4.4700us             2               -  2.2350us  1.4670us  3.0030us  cudaEventElapsedTime
                    0.00%  4.0500us             3               -  1.3500us     768ns  2.2340us  cuDeviceGetCount
                    0.00%  3.4920us             1               -  3.4920us  3.4920us  3.4920us  cudaEventSynchronize
                    0.00%  3.2810us             2               -  1.6400us     907ns  2.3740us  cuDeviceGet
                    0.00%  1.8850us             1               -  1.8850us  1.8850us  1.8850us  cudaConfigureCall
dzhoshkun commented 6 years ago
[0] % nvprof ./mandelbrot
==6846== NVPROF is profiling process 6846, command: ./mandelbrot
Mandelbrot set computed in 0.166 s, at 1618.145 Mpix/s
Copying took 0.468 s
==6846== Profiling application: ./mandelbrot
==6846== Profiling result:
            Type  Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 GPU activities:   72.36%  467.39ms             1               -  467.39ms  467.39ms  467.39ms  [CUDA memcpy DtoH]
                   25.39%  163.97ms             0              14  11.712ms  997.28us  25.495ms  mandelbrot_pixel_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int)
                    2.10%  13.586ms             1               4  2.7171ms  1.5727ms  3.3637ms  mandelbrot_block_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int, int)
                    0.15%  947.71us             0               2  473.86us  461.12us  486.59us  iter_fill_k(int*, int, int, int, int, int)
      API calls:   48.12%  468.15ms             1               -  468.15ms  468.15ms  468.15ms  cudaMemcpy
                   34.47%  335.28ms             1               -  335.28ms  335.28ms  335.28ms  cudaMalloc
                   16.99%  165.31ms             1               -  165.31ms  165.31ms  165.31ms  cudaThreadSynchronize
                    0.20%  1.9888ms             1               -  1.9888ms  1.9888ms  1.9888ms  cudaFree
                    0.10%  953.33us            94               -  10.141us     769ns  407.38us  cuDeviceGetAttribute
                    0.06%  559.77us             1               -  559.77us  559.77us  559.77us  cudaLaunch
                    0.04%  412.13us             1               -  412.13us  412.13us  412.13us  cuDeviceTotalMem
                    0.01%  79.688us             1               -  79.688us  79.688us  79.688us  cuDeviceGetName
                    0.00%  18.439us             3               -  6.1460us  4.6790us  7.7530us  cudaEventRecord
                    0.00%  8.5190us             9               -     946ns     628ns  3.0030us  cudaSetupArgument
                    0.00%  7.8920us             3               -  2.6300us  1.1180us  5.5170us  cudaEventCreate
                    0.00%  4.8200us             3               -  1.6060us     769ns  3.0730us  cuDeviceGetCount
                    0.00%  4.6100us             2               -  2.3050us  1.4670us  3.1430us  cudaEventElapsedTime
                    0.00%  4.4690us             3               -  1.4890us     977ns  2.4440us  cudaEventDestroy
                    0.00%  3.4230us             1               -  3.4230us  3.4230us  3.4230us  cudaEventSynchronize
                    0.00%  2.6540us             2               -  1.3270us     908ns  1.7460us  cuDeviceGet
                    0.00%  1.4670us             1               -  1.4670us  1.4670us  1.4670us  cudaConfigureCall
dzhoshkun commented 6 years ago
[0] % nvprof ./mandelbrot
==7068== NVPROF is profiling process 7068, command: ./mandelbrot
Mandelbrot set computed in 0.130 s, at 2061.541 Mpix/s
Copying took 0.466 s
==7068== Profiling application: ./mandelbrot
==7068== Profiling result:
            Type  Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 GPU activities:   62.57%  464.79ms             1               -  464.79ms  464.79ms  464.79ms  [CUDA memcpy DtoH]
                   17.94%  133.26ms             0              50  2.6652ms  252.42us  8.5921ms  mandelbrot_pixel_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int)
                   16.77%  124.60ms             0               8  15.575ms  123.01us  123.31ms  iter_fill_k(int*, int, int, int, int, int)
                    2.71%  20.147ms             1              14  1.3431ms  102.21us  3.5775ms  mandelbrot_block_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int, int)
      API calls:   51.61%  465.55ms             1               -  465.55ms  465.55ms  465.55ms  cudaMemcpy
                   33.71%  304.07ms             1               -  304.07ms  304.07ms  304.07ms  cudaMalloc
                   14.37%  129.67ms             1               -  129.67ms  129.67ms  129.67ms  cudaThreadSynchronize
                    0.11%  969.59us            94               -  10.314us     768ns  411.29us  cuDeviceGetAttribute
                    0.08%  745.55us             1               -  745.55us  745.55us  745.55us  cudaFree
                    0.06%  514.59us             1               -  514.59us  514.59us  514.59us  cudaLaunch
                    0.05%  431.41us             1               -  431.41us  431.41us  431.41us  cuDeviceTotalMem
                    0.01%  78.850us             1               -  78.850us  78.850us  78.850us  cuDeviceGetName
                    0.00%  20.044us             3               -  6.6810us  5.5170us  8.5200us  cudaEventRecord
                    0.00%  8.4510us             9               -     939ns     628ns  3.0730us  cudaSetupArgument
                    0.00%  7.4040us             3               -  2.4680us  1.0480us  5.1680us  cudaEventCreate
                    0.00%  4.5390us             2               -  2.2690us  1.4660us  3.0730us  cudaEventElapsedTime
                    0.00%  4.2600us             3               -  1.4200us     977ns  2.3050us  cudaEventDestroy
                    0.00%  4.1200us             3               -  1.3730us     838ns  2.0950us  cuDeviceGetCount
                    0.00%  3.6310us             1               -  3.6310us  3.6310us  3.6310us  cudaEventSynchronize
                    0.00%  2.5140us             2               -  1.2570us     908ns  1.6060us  cuDeviceGet
                    0.00%  1.3970us             1               -  1.3970us  1.3970us  1.3970us  cudaConfigureCall
dzhoshkun commented 6 years ago
[0] % nvprof ./mandelbrot
==7252== NVPROF is profiling process 7252, command: ./mandelbrot
Mandelbrot set computed in 0.102 s, at 2620.000 Mpix/s
Copying took 0.466 s
==7252== Profiling application: ./mandelbrot
==7252== Profiling result:
            Type  Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 GPU activities:   65.31%  464.80ms             1               -  464.80ms  464.80ms  464.80ms  [CUDA memcpy DtoH]
                   15.88%  113.02ms             0             170  664.82us  66.816us  2.3279ms  mandelbrot_pixel_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int)
                   11.00%  78.260ms             0              38  2.0595ms  39.584us  44.788ms  iter_fill_k(int*, int, int, int, int, int)
                    7.81%  55.551ms             1              69  793.58us  70.080us  2.8471ms  mandelbrot_block_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int, int)
      API calls:   54.04%  465.56ms             1               -  465.56ms  465.56ms  465.56ms  cudaMemcpy
                   33.82%  291.32ms             1               -  291.32ms  291.32ms  291.32ms  cudaMalloc
                   11.83%  101.92ms             1               -  101.92ms  101.92ms  101.92ms  cudaThreadSynchronize
                    0.11%  925.46us            94               -  9.8450us     768ns  404.10us  cuDeviceGetAttribute
                    0.08%  729.42us             1               -  729.42us  729.42us  729.42us  cudaFree
                    0.06%  516.12us             1               -  516.12us  516.12us  516.12us  cudaLaunch
                    0.05%  408.43us             1               -  408.43us  408.43us  408.43us  cuDeviceTotalMem
                    0.01%  75.638us             1               -  75.638us  75.638us  75.638us  cuDeviceGetName
                    0.00%  19.484us             3               -  6.4940us  5.3770us  8.6600us  cudaEventRecord
                    0.00%  7.8930us             9               -     877ns     628ns  2.5840us  cudaSetupArgument
                    0.00%  7.3340us             3               -  2.4440us  1.0480us  5.1680us  cudaEventCreate
                    0.00%  4.6800us             3               -  1.5600us     978ns  2.6540us  cudaEventDestroy
                    0.00%  4.3310us             2               -  2.1650us  1.4670us  2.8640us  cudaEventElapsedTime
                    0.00%  4.0510us             3               -  1.3500us     838ns  2.2350us  cuDeviceGetCount
                    0.00%  3.6320us             1               -  3.6320us  3.6320us  3.6320us  cudaEventSynchronize
                    0.00%  3.0730us             2               -  1.5360us     908ns  2.1650us  cuDeviceGet
                    0.00%  1.3970us             1               -  1.3970us  1.3970us  1.3970us  cudaConfigureCall
dzhoshkun commented 6 years ago
[0] % nvprof ./mandelbrot
==7442== NVPROF is profiling process 7442, command: ./mandelbrot
Mandelbrot set computed in 0.076 s, at 3546.677 Mpix/s
Copying took 0.469 s
==7442== Profiling application: ./mandelbrot
==7442== Profiling result:
            Type  Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 GPU activities:   57.77%  468.55ms             1               -  468.55ms  468.55ms  468.55ms  [CUDA memcpy DtoH]
                   16.07%  130.32ms             0             164  794.65us  14.529us  69.170ms  iter_fill_k(int*, int, int, int, int, int)
                   14.21%  115.30ms             0             554  208.12us  23.840us  1.1083ms  mandelbrot_pixel_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int)
                   11.95%  96.952ms             1             238  405.66us  49.536us  2.9909ms  mandelbrot_block_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int, int)
      API calls:   55.31%  469.30ms             1               -  469.30ms  469.30ms  469.30ms  cudaMemcpy
                   35.50%  301.26ms             1               -  301.26ms  301.26ms  301.26ms  cudaMalloc
                    8.86%  75.144ms             1               -  75.144ms  75.144ms  75.144ms  cudaThreadSynchronize
                    0.12%  1.0049ms            94               -  10.690us     838ns  435.04us  cuDeviceGetAttribute
                    0.09%  750.09us             1               -  750.09us  750.09us  750.09us  cudaFree
                    0.06%  520.10us             1               -  520.10us  520.10us  520.10us  cudaLaunch
                    0.05%  409.06us             1               -  409.06us  409.06us  409.06us  cuDeviceTotalMem
                    0.01%  79.130us             1               -  79.130us  79.130us  79.130us  cuDeviceGetName
                    0.00%  19.345us             3               -  6.4480us  4.4690us  9.3590us  cudaEventRecord
                    0.00%  7.9610us             9               -     884ns     628ns  2.6540us  cudaSetupArgument
                    0.00%  7.3330us             3               -  2.4440us  1.0470us  5.0290us  cudaEventCreate
                    0.00%  4.8190us             3               -  1.6060us     977ns  2.7940us  cudaEventDestroy
                    0.00%  4.1210us             3               -  1.3730us     769ns  2.0950us  cuDeviceGetCount
                    0.00%  4.1210us             2               -  2.0600us  1.4670us  2.6540us  cudaEventElapsedTime
                    0.00%  3.4220us             1               -  3.4220us  3.4220us  3.4220us  cudaEventSynchronize
                    0.00%  3.2130us             2               -  1.6060us     838ns  2.3750us  cuDeviceGet
                    0.00%  1.3970us             1               -  1.3970us  1.3970us  1.3970us  cudaConfigureCall
dzhoshkun commented 6 years ago
[0] % nvprof ./mandelbrot
==7683== NVPROF is profiling process 7683, command: ./mandelbrot
Mandelbrot set computed in 0.069 s, at 3882.051 Mpix/s
Copying took 0.470 s
==7683== Profiling application: ./mandelbrot
==7683== Profiling result:
            Type  Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 GPU activities:   38.04%  798.44ms             1            7727  103.32us  43.328us  2.1229ms  mandelbrot_block_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int, int)
                   27.72%  581.85ms             0           15512  37.509us  5.5360us  336.67us  mandelbrot_pixel_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int)
                   22.34%  468.90ms             1               -  468.90ms  468.90ms  468.90ms  [CUDA memcpy DtoH]
                   11.91%  249.91ms             0            7685  32.519us  4.5120us  65.005ms  iter_fill_k(int*, int, int, int, int, int)
      API calls:   56.18%  469.65ms             1               -  469.65ms  469.65ms  469.65ms  cudaMemcpy
                   35.30%  295.15ms             1               -  295.15ms  295.15ms  295.15ms  cudaMalloc
                    8.21%  68.610ms             1               -  68.610ms  68.610ms  68.610ms  cudaThreadSynchronize
                    0.11%  886.21us            94               -  9.4270us     698ns  383.64us  cuDeviceGetAttribute
                    0.09%  749.88us             1               -  749.88us  749.88us  749.88us  cudaFree
                    0.06%  514.38us             1               -  514.38us  514.38us  514.38us  cudaLaunch
                    0.04%  328.88us             1               -  328.88us  328.88us  328.88us  cuDeviceTotalMem
                    0.01%  71.308us             1               -  71.308us  71.308us  71.308us  cuDeviceGetName
                    0.00%  19.207us             3               -  6.4020us  4.4700us  8.8700us  cudaEventRecord
                    0.00%  8.0990us             9               -     899ns     559ns  2.8630us  cudaSetupArgument
                    0.00%  7.8230us             3               -  2.6070us  1.0480us  5.5180us  cudaEventCreate
                    0.00%  5.0280us             2               -  2.5140us  1.4670us  3.5610us  cudaEventElapsedTime
                    0.00%  4.8900us             3               -  1.6300us  1.0480us  2.7940us  cudaEventDestroy
                    0.00%  3.6300us             3               -  1.2100us     838ns  1.8850us  cuDeviceGetCount
                    0.00%  3.5620us             1               -  3.5620us  3.5620us  3.5620us  cudaEventSynchronize
                    0.00%  2.7250us             2               -  1.3620us     839ns  1.8860us  cuDeviceGet
                    0.00%  1.6760us             1               -  1.6760us  1.6760us  1.6760us  cudaConfigureCall
dzhoshkun commented 6 years ago

Note: export CUDA_VISIBLE_DEVICES=0 used on Adelie

dzhoshkun commented 6 years ago
[0] % nvprof ./mandelbrot
==8481== NVPROF is profiling process 8481, command: ./mandelbrot
Mandelbrot set computed in 0.070 s, at 3837.490 Mpix/s
Copying took 0.425 s
==8481== Profiling application: ./mandelbrot
==8481== Profiling result:
            Type  Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 GPU activities:   39.49%  804.79ms             1            7731  104.09us  44.416us  2.1801ms  mandelbrot_block_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int, int)
                   28.60%  582.86ms             0           15512  37.574us  5.5680us  750.53us  mandelbrot_pixel_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int)
                   20.81%  424.18ms             1               -  424.18ms  424.18ms  424.18ms  [CUDA memcpy DtoH]
                   11.09%  226.02ms             0            7685  29.410us  4.8960us  64.438ms  iter_fill_k(int*, int, int, int, int, int)
      API calls:   54.43%  424.88ms             1               -  424.88ms  424.88ms  424.88ms  cudaMemcpy
                   36.32%  283.55ms             1               -  283.55ms  283.55ms  283.55ms  cudaMalloc
                    8.90%  69.457ms             1               -  69.457ms  69.457ms  69.457ms  cudaThreadSynchronize
                    0.12%  916.87us            94               -  9.7530us     768ns  402.21us  cuDeviceGetAttribute
                    0.10%  806.38us             1               -  806.38us  806.38us  806.38us  cudaFree
                    0.06%  469.26us             1               -  469.26us  469.26us  469.26us  cudaLaunch
                    0.05%  388.73us             1               -  388.73us  388.73us  388.73us  cuDeviceTotalMem
                    0.01%  73.682us             1               -  73.682us  73.682us  73.682us  cuDeviceGetName
                    0.00%  21.930us             3               -  7.3100us  5.4480us  8.9390us  cudaEventRecord
                    0.00%  7.7510us             9               -     861ns     628ns  2.3750us  cudaSetupArgument
                    0.00%  6.6350us             3               -  2.2110us     978ns  4.4700us  cudaEventCreate
                    0.00%  5.1680us             3               -  1.7220us  1.1170us  2.7240us  cudaEventDestroy
                    0.00%  4.1210us             2               -  2.0600us  1.3970us  2.7240us  cudaEventElapsedTime
                    0.00%  3.8410us             3               -  1.2800us     768ns  2.0950us  cuDeviceGetCount
                    0.00%  3.7710us             1               -  3.7710us  3.7710us  3.7710us  cudaEventSynchronize
                    0.00%  2.2350us             2               -  1.1170us     838ns  1.3970us  cuDeviceGet
                    0.00%  1.3270us             1               -  1.3270us  1.3270us  1.3270us  cudaConfigureCall
dzhoshkun commented 6 years ago
[0] % nvprof ./mandelbrot
==8690== NVPROF is profiling process 8690, command: ./mandelbrot
Mandelbrot set computed in 0.059 s, at 4514.164 Mpix/s
Copying took 0.412 s
==8690== Profiling application: ./mandelbrot
==8690== Profiling result:
            Type  Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 GPU activities:   52.28%  411.01ms             1               -  411.01ms  411.01ms  411.01ms  [CUDA memcpy DtoH]
                   20.26%  159.30ms             1             788  201.90us  46.240us  2.5210ms  mandelbrot_block_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int, int)
                   17.90%  140.75ms             0            1736  81.078us  11.392us  404.58us  mandelbrot_pixel_k(int*, int, int, thrust::complex<float>, thrust::complex<float>, int, int, int)
                    9.55%  75.050ms             0             644  116.54us  9.0560us  6.7421ms  iter_fill_k(int*, int, int, int, int, int)
      API calls:   56.62%  411.64ms             1               -  411.64ms  411.64ms  411.64ms  cudaMemcpy
                   34.96%  254.17ms             1               -  254.17ms  254.17ms  254.17ms  cudaMalloc
                    8.11%  58.931ms             1               -  58.931ms  58.931ms  58.931ms  cudaThreadSynchronize
                    0.10%  756.17us             1               -  756.17us  756.17us  756.17us  cudaFree
                    0.10%  744.85us            94               -  7.9230us     628ns  325.88us  cuDeviceGetAttribute
                    0.07%  509.14us             1               -  509.14us  509.14us  509.14us  cudaLaunch
                    0.03%  184.03us             1               -  184.03us  184.03us  184.03us  cuDeviceTotalMem
                    0.01%  52.870us             1               -  52.870us  52.870us  52.870us  cuDeviceGetName
                    0.00%  20.602us             3               -  6.8670us  6.1460us  7.9610us  cudaEventRecord
                    0.00%  8.4500us             9               -     938ns     628ns  2.7940us  cudaSetupArgument
                    0.00%  7.6120us             3               -  2.5370us  1.0470us  5.3780us  cudaEventCreate
                    0.00%  4.1210us             3               -  1.3730us     908ns  2.3050us  cudaEventDestroy
                    0.00%  3.8410us             2               -  1.9200us  1.3270us  2.5140us  cudaEventElapsedTime
                    0.00%  3.1430us             1               -  3.1430us  3.1430us  3.1430us  cudaEventSynchronize
                    0.00%  2.7250us             3               -     908ns     629ns  1.3270us  cuDeviceGetCount
                    0.00%  1.8860us             2               -     943ns     698ns  1.1880us  cuDeviceGet
                    0.00%  1.3970us             1               -  1.3970us  1.3970us  1.3970us  cudaConfigureCall
dzhoshkun commented 6 years ago

MAX_DEPTH increases parallelism, as the deeper the recursion goes, the more the threads that get launched (by the leader of each block). INIT_SUBDIV increases parallelism, the larger its value the more thread blocks get launched with smaller parts of the problem to compute.

However the story seems to change from MAX_DEPTH=6 to MAX_DEPTH=12 with INIT_SUBDIV=4