transducer grad compute formular

csukuangfj / optimized_transducer

Memory efficient transducer loss computation

Other

68 stars 12 forks source link

transducer grad compute formular #37

Open zh794390558 opened 2 years ago

zh794390558 commented 2 years ago

The formular for gradient is below inwarprnnt_numba and warp_transducer cpu:

    T, U, _ = log_probs.shape
    grads = np.full(log_probs.shape, -float("inf"))
    log_like = betas[0, 0]  # == alphas[T - 1, U - 1] + betas[T - 1, U - 1]

    # // grad to last blank transition
    grads[T - 1, U - 1, blank] = alphas[T - 1, U - 1]
    grads[: T - 1, :, blank] = alphas[: T - 1, :] + betas[1:, :]

    # // grad to label transition
    for u, l in enumerate(labels):
        grads[:, u, l] = alphas[:, u] + betas[:, u + 1]

    grads = -np.exp(grads + log_probs - log_like)

that is not same to torchaudio, optimized_transducer and ,warp_transducer gpu, but you said that warp_transducer cpu grad is same to optimized_transducer and torchaudio, how that is achieved?

csukuangfj commented 2 years ago

but you said that warp_transducer cpu grad is same to optimized_transducer and torchaudio

Where did you find that?

csukuangfj commented 2 years ago

The README.md says:

Therefore, optimized_transducer produces the same alpha and beta as warp-transducer for the same input.

It only says alpha and beta, not grad.

zh794390558 commented 2 years ago

It borrows the methods of computing alpha and beta from warp-transducer. Therefore, optimized_transducer produces the same alpha and beta as warp-transducer for the same input.

However, warp-transducer produces different gradients for CPU and CUDA when using the same input. See https://github.com/HawkAaron/warp-transducer/issues/93. I also created a [colab notebook](https://colab.research.google.com/drive/1vMkH8LmiCCOiCo4KTTEcv-NU8_OGn0ie?usp=sharing) to reproduce that issue.

This project produces consistent gradient on CPU and CUDA for the same input, just like what torchaudio is doing. (We borrow the gradient computation formula from torchaudio).

Sorry, I got it wrong. So for the known conclusion, trochaudio is aligned with optimized_transducer. The warp_transducer gpu will has the same grad result as optimized_transducer, beside warp_transducer cpu since the gradient formula is not right?

zh794390558 commented 2 years ago

why cpu and gpu loss for warp_transducer is not equal, in the codelab?