Closed AshishSardana closed 2 years ago
Hi @AshishSardana, thank you very much for this issue! I can't believe I made such a stupid mistake in my benchmark script. In my defence I will say that when I build this package I used NVIDIA Visual Profiler, and it showed very clean performance boost. I will definitely look at this on weekend. Could you also check - is it related some how to torch.no_grad()? Additionally, could you check the gather
flag, which should slitty improve a memory issue?
As of 2021.Apr 24th, pytorch_binding/benchmark.py
with above torch.cuda.synchronize()
yields:
setup(T=150, U=40, V=28) | warp_rnnt | warprnnt_pytorch |
---|---|---|
N=1 | 0.40 | 0.73 |
N=16 | 1.31 | 1.82 |
N=32 | 2.17 | 3.58 |
N=64 | 4.02 | 6.36 |
N=128 | 7.64 | 9.42 |
Hopefully this will save others' time from repeating these benchmarks.
@dophist thank you! Could you provide hw spec?
It was tested on 2080ti
Finally, I updated the benchmark results. Indeed the warping doesn't show superior performance if you synchronise device on each iteration. On the other hand, this sync-free implementation can be useful in complex use case. Additionally, I added gather
version which shows great performance with a large vocabulary.
I added one more explanation 2944982 of performance issue with NVIDIA Profiler.
Hi @1ytic ,
Thank you for this work on optimizing warp rnn-t operation which is becoming increasingly useful for many speech recognition acoustic models. We have studied your implementation and have the following observations:
This way of measuring has a problem that CPU could run ahead of GPU and stop the timer even before the kernel is completed. After adding synchronization as below,
the run time we get is:
In summary - It does not look like that the alternative loss function is running faster than what we have. The claimed speedup in the repo is likely caused by a flawed benchmark methodology.
Can you share your thought process on this?
Thanks, Ashish