1ytic / warp-rnnt

CUDA-Warp RNN-Transducer
MIT License
211 stars 41 forks source link

Benchmarking methodology used is not quite correct #9

Closed AshishSardana closed 2 years ago

AshishSardana commented 3 years ago

Hi @1ytic ,

Thank you for this work on optimizing warp rnn-t operation which is becoming increasingly useful for many speech recognition acoustic models. We have studied your implementation and have the following observations:

  1. When we run the benchmark script in your repo, the run time we got is as below, where new refers to this new repo, and the baseline is what we have now in RNN-T reference model. We used B=32, T=U=200, V= 29, which is a typical case in our dataset. From the output of the benchmark script, it does appear that the new loss function runs faster:

new: 1.76 ms baseline: 6.10 ms

  1. However, in the benchmark script that the author provided, the run time was measured as:
t = timer()
costs = loss(xs, ys, xn, yn)
elapsed_time += timer() - t

This way of measuring has a problem that CPU could run ahead of GPU and stop the timer even before the kernel is completed. After adding synchronization as below,

torch.cuda.synchronize() # sync before start the timer
t = timer()
costs = loss(xs, ys, xn, yn)
torch.cuda.synchronize() # sync before stop the timer
elapsed_time += timer() - t

the run time we get is:

new: 15.82 ms baseline: 6.12 ms

  1. This is similar to the run time we got from GPU profiler nsys:

new: 14.38 ms baseline: 4.75 ms

In summary - It does not look like that the alternative loss function is running faster than what we have. The claimed speedup in the repo is likely caused by a flawed benchmark methodology.

Can you share your thought process on this?

Thanks, Ashish

1ytic commented 3 years ago

Hi @AshishSardana, thank you very much for this issue! I can't believe I made such a stupid mistake in my benchmark script. In my defence I will say that when I build this package I used NVIDIA Visual Profiler, and it showed very clean performance boost. I will definitely look at this on weekend. Could you also check - is it related some how to torch.no_grad()? Additionally, could you check the gather flag, which should slitty improve a memory issue?

dophist commented 3 years ago

As of 2021.Apr 24th, pytorch_binding/benchmark.py with above torch.cuda.synchronize() yields:

setup(T=150, U=40, V=28) warp_rnnt warprnnt_pytorch
N=1 0.40 0.73
N=16 1.31 1.82
N=32 2.17 3.58
N=64 4.02 6.36
N=128 7.64 9.42

Hopefully this will save others' time from repeating these benchmarks.

1ytic commented 3 years ago

@dophist thank you! Could you provide hw spec?

dophist commented 3 years ago

It was tested on 2080ti

1ytic commented 2 years ago

Finally, I updated the benchmark results. Indeed the warping doesn't show superior performance if you synchronise device on each iteration. On the other hand, this sync-free implementation can be useful in complex use case. Additionally, I added gather version which shows great performance with a large vocabulary.

1ytic commented 2 years ago

I added one more explanation 2944982 of performance issue with NVIDIA Profiler.