Closed sgkim126 closed 7 years ago
nvprof 사용
Time(%) Time Calls Avg Min Max Name
18.00% 6.07211s 5410 1.1224ms 1.1088ms 1.1677ms maxwell_sgemm_128x64_raggedMn_tn_splitK
16.31% 5.50357s 10820 508.65us 499.76us 677.27us maxwell_sgemm_128x128_raggedMn_nt
13.22% 4.45998s 5411 824.24us 811.35us 966.81us maxwell_sgemm_128x64_raggedMn_nn_splitK
10.74% 3.62274s 5411 669.51us 652.12us 790.65us maxwell_sgemm_128x128_raggedMn_nn
8.61% 2.90349s 10820 268.34us 262.66us 329.16us kSGDUpdateWeights_kernel(float, float, unsigned long, float*, float*)
6.03% 2.03362s 10822 187.91us 1.7600us 454.70us kCalculateSigmoidActivation_kernel(float*, unsigned long)
5.96% 2.01016s 5410 371.56us 362.48us 449.49us kCalculateSparseRawSigmoidScaledMarginalCrossEntropyOutputDelta_kernel(unsigned long, float*, float*)
5.70% 1.92265s 5410 355.39us 336.49us 461.04us kCalculateSparseRawScaledMarginalCrossEntropyError_kernel(float*, unsigned long)
5.33% 1.79896s 10820 166.26us 162.41us 190.21us kCalculateRegularizationError_kernel(float*, unsigned long)
3.49% 1.17737s 10822 108.79us 1.9840us 270.73us kClearUnit_kernel(float*, float*, unsigned int, unsigned long)
3.15% 1.06282s 10820 98.226us 18.048us 205.57us kSGDUpdateBiases_kernel(float, unsigned int, unsigned int, float*, float*)
0.83% 280.67ms 5410 51.880us 50.978us 63.074us kCalculateSparsenessPenalty_kernel(unsigned int, unsigned int, float*, float*, float, float)
0.82% 276.38ms 5410 51.087us 26.721us 172.39us kCalculateSparseNonZeroSigmoidScaledMarginalCrossEntropyOutputDelta_kernel(unsigned int, unsigned int
, unsigned int, float*, float*, unsigned long*, unsigned long*, unsigned int*)
0.66% 221.83ms 5410 41.003us 19.169us 159.27us kLoadSparseDenoisedInputUnit_kernel(unsigned int, unsigned int, unsigned int, float*, unsigned long*,
unsigned long*, unsigned int*, float*)
0.63% 212.84ms 5410 39.341us 17.568us 169.86us kCalculateSparseNonZeroScaledMarginalCrossEntropyError_kernel(unsigned int, unsigned int, unsigned in
t, float*, unsigned long*, unsigned long*, unsigned int*)
0.23% 76.746ms 20 3.8373ms 832ns 37.577ms [CUDA memcpy HtoD]
0.16% 55.661ms 16242 3.4260us 864ns 7.2188ms [CUDA memcpy DtoH]
0.08% 28.578ms 32486 879ns 736ns 483.02us [CUDA memset]
0.03% 10.202ms 5410 1.8850us 1.6640us 2.3360us kCalculateSigmoidHadamardProduct_kernel(unsigned long, float*, float*, float, float)
0.02% 6.3059ms 12 525.49us 88.963us 635.38us void gen_sequenced<curandStateXORWOW, float, int, __operator_&__(float curand_uniform_noargs<curandSt
ateXORWOW>(curandStateXORWOW*, int))>(curandStateXORWOW*, float*, unsigned long, unsigned long, int)
0.01% 3.9288ms 1 3.9288ms 3.9288ms 3.9288ms generate_seed_pseudo(unsigned __int64, unsigned __int64, unsigned __int64, curandOrdering, curandStat
eXORWOW*, unsigned int*)
0.00% 374.60us 4 93.650us 960ns 186.63us kScaleAndBias_kernel(float*, unsigned long, float, float)
0.00% 30.977us 1 30.977us 30.977us 30.977us kLoadSparseInputUnit_kernel(unsigned int, unsigned int, unsigned int, float*, unsigned long*, unsigne
d long*, unsigned int*)
Time(%) Time Calls Avg Min Max Name
82.61% 30.6627s 16250 1.8869ms 17.190us 37.969ms cudaMemcpy
7.46% 2.76872s 119044 23.257us 12.134us 1.5348ms cudaLaunch
2.24% 831.89ms 31 26.835ms 391ns 502.55ms cudaFree
1.96% 726.34ms 8 90.793ms 29.147us 726.12ms cudaStreamCreateWithFlags
1.54% 570.48ms 21665 26.331us 14.198us 683.56us cudaMemset
1.49% 551.50ms 1055099 522ns 251ns 716.15us cudaSetupArgument
1.15% 427.42ms 1 427.42ms 427.42ms 427.42ms cudaThreadExit
0.75% 276.97ms 10821 25.595us 14.141us 1.4100ms cudaMemsetAsync
0.28% 103.21ms 119044 866ns 308ns 657.57us cudaConfigureCall
0.24% 89.006ms 119057 747ns 252ns 763.86us cudaGetLastError
0.15% 55.636ms 10821 5.1410us 2.5220us 664.53us cudaEventQuery
0.12% 42.942ms 10821 3.9680us 1.8720us 611.41us cudaEventRecord
0.02% 7.7172ms 30 257.24us 9.0260us 623.66us cudaMalloc
0.00% 1.4167ms 5 283.35us 252.36us 351.61us cudaGetDeviceProperties
0.00% 1.2396ms 352 3.5210us 191ns 203.98us cuDeviceGetAttribute
0.00% 673.81us 12 56.151us 10.315us 335.64us cudaMemcpyToSymbol
0.00% 490.59us 4 122.65us 115.69us 138.04us cuDeviceTotalMem
0.00% 186.90us 4 46.725us 34.135us 64.767us cuDeviceGetName
0.00% 63.165us 40 1.5790us 1.2370us 4.9730us cudaEventDestroy
0.00% 56.066us 8 7.0080us 5.1200us 19.276us cudaStreamDestroy
0.00% 52.214us 6 8.7020us 6.4620us 10.908us cudaThreadSynchronize
0.00% 39.745us 40 993ns 759ns 2.3950us cudaEventCreateWithFlags
0.00% 21.535us 1 21.535us 21.535us 21.535us cudaDeviceSynchronize
0.00% 20.408us 5 4.0810us 1.6200us 8.8150us cudaGetDevice
0.00% 17.103us 32 534ns 403ns 1.8650us cudaDeviceGetAttribute
0.00% 7.6090us 1 7.6090us 7.6090us 7.6090us cudaSetDeviceFlags
0.00% 4.1800us 6 696ns 282ns 1.9430us cuDeviceGetCount
0.00% 3.4290us 1 3.4290us 3.4290us 3.4290us cudaSetValidDevices
0.00% 2.7940us 3 931ns 786ns 1.1920us cuInit
0.00% 2.6570us 6 442ns 304ns 774ns cuDeviceGet
0.00% 2.2970us 1 2.2970us 2.2970us 2.2970us cudaSetDevice
0.00% 1.4990us 3 499ns 364ns 680ns cuDriverGetVersion
0.00% 797ns 1 797ns 797ns 797ns cudaGetDeviceCount
프로파일러 붙여서 병목 찾아보기