csukuangfj / optimized_transducer

Memory efficient transducer loss computation
Other
68 stars 12 forks source link

transducer grad compute formular #37

Open zh794390558 opened 2 years ago

zh794390558 commented 2 years ago

The formular for gradient is below inwarprnnt_numba and warp_transducer cpu:

    T, U, _ = log_probs.shape
    grads = np.full(log_probs.shape, -float("inf"))
    log_like = betas[0, 0]  # == alphas[T - 1, U - 1] + betas[T - 1, U - 1]

    # // grad to last blank transition
    grads[T - 1, U - 1, blank] = alphas[T - 1, U - 1]
    grads[: T - 1, :, blank] = alphas[: T - 1, :] + betas[1:, :]

    # // grad to label transition
    for u, l in enumerate(labels):
        grads[:, u, l] = alphas[:, u] + betas[:, u + 1]

    grads = -np.exp(grads + log_probs - log_like)

that is not same to torchaudio, optimized_transducer and ,warp_transducer gpu, but you said that warp_transducer cpu grad is same to optimized_transducer and torchaudio, how that is achieved?

csukuangfj commented 2 years ago

but you said that warp_transducer cpu grad is same to optimized_transducer and torchaudio

Where did you find that?

csukuangfj commented 2 years ago

The README.md says:

Therefore, optimized_transducer produces the same alpha and beta as warp-transducer for the same input.

It only says alpha and beta, not grad.

zh794390558 commented 2 years ago
It borrows the methods of computing alpha and beta from warp-transducer. Therefore, optimized_transducer produces the same alpha and beta as warp-transducer for the same input.

However, warp-transducer produces different gradients for CPU and CUDA when using the same input. See https://github.com/HawkAaron/warp-transducer/issues/93. I also created a [colab notebook](https://colab.research.google.com/drive/1vMkH8LmiCCOiCo4KTTEcv-NU8_OGn0ie?usp=sharing) to reproduce that issue.

This project produces consistent gradient on CPU and CUDA for the same input, just like what torchaudio is doing. (We borrow the gradient computation formula from torchaudio).

Sorry, I got it wrong. So for the known conclusion, trochaudio is aligned with optimized_transducer. The warp_transducer gpu will has the same grad result as optimized_transducer, beside warp_transducer cpu since the gradient formula is not right?

zh794390558 commented 2 years ago

image why cpu and gpu loss for warp_transducer is not equal, in the codelab?

I think the above wrong conclusion is got from here. image

csukuangfj commented 2 years ago

The warp_transducer gpu will has the same grad result as optimized_transducer

No. You can find the conclusions in the colab (listed in the README.md).


why cpu and gpu loss for warp_transducer is not equal, in the codelab?

Please ask the author of warp-transducer.

zh794390558 commented 2 years ago
image

用的codalab的case, espnet的rnnt,结果是一致的。是我使用有问题吗?

image image
csukuangfj commented 2 years ago

我刚刚又跑了一遍上面的 colab notebook, 发现复现不了以前的结果了。不知道哪里出问题了。

zh794390558 commented 2 years ago

所以这个问题还有吗?可能是cuda版本问题?

BTW, 能把colab里的torch版本固定吗? 上次跑了下,发现无法跑通。

csukuangfj commented 2 years ago

codelab

readme.md 中,给的 colab notebook, 里面使用了 Tesla K80 gpu.

我今天试的 colab notebook, 被分配到了 Tesla T4, 所以测试环境不一样了。

如果你能在 Tesla K80 gpu 中复现的话,那么,这个问题,就是存在的。不能的话,那么应该就不存在了。

(我稍后在本地的 v100 gpu 中,看能不能复现).


BTW, 能把colab里的torch版本固定吗? 上次跑了下,发现无法跑通。

可以的。