Closed GiggleLiu closed 9 months ago
Now the launch overhead is more than 99%
➜ modules git:(GPUdemo) ✗ nvprof julia QCBMS.jl ==22279== NVPROF is profiling process 22279, command: julia QCBMS.jl ==22279== Profiling application: julia QCBMS.jl ==22279== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 70.36% 77.0104s 810000 95.074us 74.113us 279.72us ptxcall_simple_kernel_2 28.96% 31.6927s 720000 44.017us 32.896us 113.19us ptxcall_simple_kernel_3 0.68% 748.96ms 10000 74.895us 72.801us 79.361us ptxcall_anonymous23_1 0.00% 1.1371ms 4 284.27us 1.7600us 1.0389ms [CUDA memcpy HtoD] API calls: 99.11% 90.5692s 1540000 58.811us 6.5610us 9.6723ms cuLaunchKernel 0.43% 389.37ms 1540034 252ns 145ns 649.65us cuCtxGetCurrent 0.23% 210.94ms 1 210.94ms 210.94ms 210.94ms cuCtxCreate 0.14% 129.13ms 1 129.13ms 129.13ms 129.13ms cuCtxDestroy 0.07% 65.987ms 3 21.996ms 47.171us 65.891ms cuModuleUnload 0.01% 13.700ms 27 507.41us 439.26us 724.08us cuMemAlloc 0.00% 2.5056ms 3 835.19us 348.68us 1.7719ms cuModuleLoadDataEx 0.00% 1.4557ms 4 363.94us 43.000us 1.1706ms cuMemcpyHtoD 0.00% 36.489us 8 4.5610us 3.6320us 8.1710us cuDeviceGetPCIBusId 0.00% 15.972us 30 532ns 167ns 2.4170us cuDeviceGetAttribute 0.00% 9.0610us 9 1.0060us 283ns 4.6000us cuDeviceGet 0.00% 3.2120us 3 1.0700us 1.0430us 1.0890us cuModuleGetFunction 0.00% 2.6260us 3 875ns 707ns 1.0060us cuCtxGetDevice 0.00% 2.4400us 1 2.4400us 2.4400us 2.4400us cuDriverGetVersion 0.00% 2.0020us 2 1.0010us 282ns 1.7200us cuDeviceGetCount
does not make sense.
Now the launch overhead is more than 99%