ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.09k stars 231 forks source link

RNN wall clock timer update #3180

Closed shurale-nkn closed 3 months ago

shurale-nkn commented 4 months ago

More informative time tracking system for RNN.

as an example, comparing the time of rocBlas with hipBlaslt. hipBlaslt approach is 30% slower on each call at host side and 10x slower at first runtime init call.

rocBlas log

# MIOPEN_GEMM_ENFORCE_BACKEND=1 ./bin/MIOpenDriver rnnfp16 -n 1024 -W 256 -H 1000 -l 8 -b 1 -m lstm -p 0 -r 0 -k 32 -F 0 -c 0 --iter 8 -w 2 -V 0
MIOpenDriver rnnfp16 -n 1024 -W 256 -H 1000 -l 8 -b 1 -m lstm -p 0 -r 0 -k 32 -F 0 -c 0 --iter 8 -w 2 -V 0
length of data sequence == 1 is short than time sequence == 32, padding the rest of data sequence with 1024
length of data sequence == 1 is short than time sequence == 32, padding the rest of data sequence with 1024
PRNG seed: 12345678
Forward RNN time results:
launch# 0 , host_time= 2458.317871 , gpu_time= 2461.264160
launch# 1 , host_time= 9.096022 , gpu_time= 19.923912
launch# 2 , host_time= 9.268521 , gpu_time= 20.110058
launch# 3 , host_time= 9.097228 , gpu_time= 19.838373
launch# 4 , host_time= 9.156680 , gpu_time= 19.923559
launch# 5 , host_time= 9.147833 , gpu_time= 20.199554
launch# 6 , host_time= 9.076235 , gpu_time= 20.391045
launch# 7 , host_time= 9.246337 , gpu_time= 20.218939
GPU Kernel Time Elapsed: 20.086491 ms
Wall-clock Time Elapsed: 9.155551 ms
Backward Data RNN time results:
launch# 0 , host_time= 1985.195435 , gpu_time= 1988.019653
launch# 1 , host_time= 8.604473 , gpu_time= 19.426306
launch# 2 , host_time= 8.750003 , gpu_time= 19.391779
launch# 3 , host_time= 8.610219 , gpu_time= 19.216003
launch# 4 , host_time= 8.618001 , gpu_time= 19.462749
launch# 5 , host_time= 8.626191 , gpu_time= 19.213589
launch# 6 , host_time= 8.637055 , gpu_time= 19.537502
launch# 7 , host_time= 8.611468 , gpu_time= 19.227840
GPU Kernel Time Elapsed: 19.353683 ms
Wall-clock Time Elapsed: 8.636773 ms
Backward Weights RNN time results:
launch# 0 , host_time= 3888.988281 , gpu_time= 3906.575928
launch# 1 , host_time= 0.671091 , gpu_time= 18.956778
launch# 2 , host_time= 0.713387 , gpu_time= 18.850042
launch# 3 , host_time= 0.579331 , gpu_time= 18.995516
launch# 4 , host_time= 0.626673 , gpu_time= 18.680723
launch# 5 , host_time= 0.621764 , gpu_time= 18.955204
launch# 6 , host_time= 0.652968 , gpu_time= 18.725475
launch# 7 , host_time= 0.568614 , gpu_time= 18.955967
GPU Kernel Time Elapsed: 18.874243 ms
Wall-clock Time Elapsed: 0.633404 ms

hipBlaslt log

# MIOPEN_GEMM_ENFORCE_BACKEND=5 ./bin/MIOpenDriver rnnfp16 -n 1024 -W 256 -H 1000 -l 8 -b 1 -m lstm -p 0 -r 0 -k 32 -F 0 -c 0 --iter 8 -w 2 -V 0
MIOpenDriver rnnfp16 -n 1024 -W 256 -H 1000 -l 8 -b 1 -m lstm -p 0 -r 0 -k 32 -F 0 -c 0 --iter 8 -w 2 -V 0
length of data sequence == 1 is short than time sequence == 32, padding the rest of data sequence with 1024
length of data sequence == 1 is short than time sequence == 32, padding the rest of data sequence with 1024
PRNG seed: 12345678
Forward RNN time results:
launch# 0 , host_time= 25977.490234 , gpu_time= 25980.847656
launch# 1 , host_time= 11.535239 , gpu_time= 22.689487
launch# 2 , host_time= 12.296590 , gpu_time= 22.820551
launch# 3 , host_time= 12.231676 , gpu_time= 23.120279
launch# 4 , host_time= 11.865993 , gpu_time= 22.409124
launch# 5 , host_time= 11.767879 , gpu_time= 22.608805
launch# 6 , host_time= 11.694280 , gpu_time= 22.938021
launch# 7 , host_time= 13.758967 , gpu_time= 22.739683
GPU Kernel Time Elapsed: 22.760851 ms
Wall-clock Time Elapsed: 12.164375 ms
Backward Data RNN time results:
launch# 0 , host_time= 2822.273193 , gpu_time= 2824.930664
launch# 1 , host_time= 11.930065 , gpu_time= 20.319250
launch# 2 , host_time= 11.725726 , gpu_time= 20.698683
launch# 3 , host_time= 11.642274 , gpu_time= 20.550753
launch# 4 , host_time= 11.614553 , gpu_time= 20.710871
launch# 5 , host_time= 11.592884 , gpu_time= 20.408436
launch# 6 , host_time= 11.654457 , gpu_time= 20.578505
launch# 7 , host_time= 11.559514 , gpu_time= 20.630484
GPU Kernel Time Elapsed: 20.556711 ms
Wall-clock Time Elapsed: 11.674211 ms
Backward Weights RNN time results:
launch# 0 , host_time= 3926.899414 , gpu_time= 3941.994385
launch# 1 , host_time= 0.674931 , gpu_time= 16.413563
launch# 2 , host_time= 0.648857 , gpu_time= 16.370653
launch# 3 , host_time= 0.757513 , gpu_time= 16.337246
launch# 4 , host_time= 0.660882 , gpu_time= 16.302555
launch# 5 , host_time= 0.670766 , gpu_time= 16.357071
launch# 6 , host_time= 0.663294 , gpu_time= 16.395950
launch# 7 , host_time= 0.651782 , gpu_time= 16.202929
GPU Kernel Time Elapsed: 16.339994 ms
Wall-clock Time Elapsed: 0.675432 ms