cloudtrack / dsstne-starter

DSSTNE Starter is easy to help making a sample project of Deep Learning based on Amazon DSSTNE and AWS G2 Instances.
3 stars 0 forks source link

Profiler 붙이기 #8

Closed sgkim126 closed 7 years ago

sgkim126 commented 7 years ago

프로파일러 붙여서 병목 찾아보기

sgkim126 commented 7 years ago

nvprof 사용

Time(%)      Time     Calls       Avg       Min       Max  Name
 18.00%  6.07211s      5410  1.1224ms  1.1088ms  1.1677ms  maxwell_sgemm_128x64_raggedMn_tn_splitK
 16.31%  5.50357s     10820  508.65us  499.76us  677.27us  maxwell_sgemm_128x128_raggedMn_nt
 13.22%  4.45998s      5411  824.24us  811.35us  966.81us  maxwell_sgemm_128x64_raggedMn_nn_splitK
 10.74%  3.62274s      5411  669.51us  652.12us  790.65us  maxwell_sgemm_128x128_raggedMn_nn
  8.61%  2.90349s     10820  268.34us  262.66us  329.16us  kSGDUpdateWeights_kernel(float, float, unsigned long, float*, float*)
  6.03%  2.03362s     10822  187.91us  1.7600us  454.70us  kCalculateSigmoidActivation_kernel(float*, unsigned long)
  5.96%  2.01016s      5410  371.56us  362.48us  449.49us  kCalculateSparseRawSigmoidScaledMarginalCrossEntropyOutputDelta_kernel(unsigned long, float*, float*)
  5.70%  1.92265s      5410  355.39us  336.49us  461.04us  kCalculateSparseRawScaledMarginalCrossEntropyError_kernel(float*, unsigned long)
  5.33%  1.79896s     10820  166.26us  162.41us  190.21us  kCalculateRegularizationError_kernel(float*, unsigned long)
  3.49%  1.17737s     10822  108.79us  1.9840us  270.73us  kClearUnit_kernel(float*, float*, unsigned int, unsigned long)
  3.15%  1.06282s     10820  98.226us  18.048us  205.57us  kSGDUpdateBiases_kernel(float, unsigned int, unsigned int, float*, float*)
  0.83%  280.67ms      5410  51.880us  50.978us  63.074us  kCalculateSparsenessPenalty_kernel(unsigned int, unsigned int, float*, float*, float, float)
  0.82%  276.38ms      5410  51.087us  26.721us  172.39us  kCalculateSparseNonZeroSigmoidScaledMarginalCrossEntropyOutputDelta_kernel(unsigned int, unsigned int
, unsigned int, float*, float*, unsigned long*, unsigned long*, unsigned int*)
  0.66%  221.83ms      5410  41.003us  19.169us  159.27us  kLoadSparseDenoisedInputUnit_kernel(unsigned int, unsigned int, unsigned int, float*, unsigned long*,
 unsigned long*, unsigned int*, float*)
  0.63%  212.84ms      5410  39.341us  17.568us  169.86us  kCalculateSparseNonZeroScaledMarginalCrossEntropyError_kernel(unsigned int, unsigned int, unsigned in
t, float*, unsigned long*, unsigned long*, unsigned int*)
  0.23%  76.746ms        20  3.8373ms     832ns  37.577ms  [CUDA memcpy HtoD]
  0.16%  55.661ms     16242  3.4260us     864ns  7.2188ms  [CUDA memcpy DtoH]
  0.08%  28.578ms     32486     879ns     736ns  483.02us  [CUDA memset]
  0.03%  10.202ms      5410  1.8850us  1.6640us  2.3360us  kCalculateSigmoidHadamardProduct_kernel(unsigned long, float*, float*, float, float)
  0.02%  6.3059ms        12  525.49us  88.963us  635.38us  void gen_sequenced<curandStateXORWOW, float, int, __operator_&__(float curand_uniform_noargs<curandSt
ateXORWOW>(curandStateXORWOW*, int))>(curandStateXORWOW*, float*, unsigned long, unsigned long, int)
  0.01%  3.9288ms         1  3.9288ms  3.9288ms  3.9288ms  generate_seed_pseudo(unsigned __int64, unsigned __int64, unsigned __int64, curandOrdering, curandStat
eXORWOW*, unsigned int*)
  0.00%  374.60us         4  93.650us     960ns  186.63us  kScaleAndBias_kernel(float*, unsigned long, float, float)
  0.00%  30.977us         1  30.977us  30.977us  30.977us  kLoadSparseInputUnit_kernel(unsigned int, unsigned int, unsigned int, float*, unsigned long*, unsigne
d long*, unsigned int*)
Time(%)      Time     Calls       Avg       Min       Max  Name
 82.61%  30.6627s     16250  1.8869ms  17.190us  37.969ms  cudaMemcpy
  7.46%  2.76872s    119044  23.257us  12.134us  1.5348ms  cudaLaunch
  2.24%  831.89ms        31  26.835ms     391ns  502.55ms  cudaFree
  1.96%  726.34ms         8  90.793ms  29.147us  726.12ms  cudaStreamCreateWithFlags
  1.54%  570.48ms     21665  26.331us  14.198us  683.56us  cudaMemset
  1.49%  551.50ms   1055099     522ns     251ns  716.15us  cudaSetupArgument
  1.15%  427.42ms         1  427.42ms  427.42ms  427.42ms  cudaThreadExit
  0.75%  276.97ms     10821  25.595us  14.141us  1.4100ms  cudaMemsetAsync
  0.28%  103.21ms    119044     866ns     308ns  657.57us  cudaConfigureCall
  0.24%  89.006ms    119057     747ns     252ns  763.86us  cudaGetLastError
  0.15%  55.636ms     10821  5.1410us  2.5220us  664.53us  cudaEventQuery
  0.12%  42.942ms     10821  3.9680us  1.8720us  611.41us  cudaEventRecord
  0.02%  7.7172ms        30  257.24us  9.0260us  623.66us  cudaMalloc
  0.00%  1.4167ms         5  283.35us  252.36us  351.61us  cudaGetDeviceProperties
  0.00%  1.2396ms       352  3.5210us     191ns  203.98us  cuDeviceGetAttribute
  0.00%  673.81us        12  56.151us  10.315us  335.64us  cudaMemcpyToSymbol
  0.00%  490.59us         4  122.65us  115.69us  138.04us  cuDeviceTotalMem
  0.00%  186.90us         4  46.725us  34.135us  64.767us  cuDeviceGetName
  0.00%  63.165us        40  1.5790us  1.2370us  4.9730us  cudaEventDestroy
  0.00%  56.066us         8  7.0080us  5.1200us  19.276us  cudaStreamDestroy
  0.00%  52.214us         6  8.7020us  6.4620us  10.908us  cudaThreadSynchronize
  0.00%  39.745us        40     993ns     759ns  2.3950us  cudaEventCreateWithFlags
  0.00%  21.535us         1  21.535us  21.535us  21.535us  cudaDeviceSynchronize
  0.00%  20.408us         5  4.0810us  1.6200us  8.8150us  cudaGetDevice
  0.00%  17.103us        32     534ns     403ns  1.8650us  cudaDeviceGetAttribute
  0.00%  7.6090us         1  7.6090us  7.6090us  7.6090us  cudaSetDeviceFlags
  0.00%  4.1800us         6     696ns     282ns  1.9430us  cuDeviceGetCount
  0.00%  3.4290us         1  3.4290us  3.4290us  3.4290us  cudaSetValidDevices
  0.00%  2.7940us         3     931ns     786ns  1.1920us  cuInit
  0.00%  2.6570us         6     442ns     304ns     774ns  cuDeviceGet
  0.00%  2.2970us         1  2.2970us  2.2970us  2.2970us  cudaSetDevice
  0.00%  1.4990us         3     499ns     364ns     680ns  cuDriverGetVersion
  0.00%     797ns         1     797ns     797ns     797ns  cudaGetDeviceCount