Warprnnt gradient for CPU

titu1994 commented 2 years ago

@csukuangfj Just wanted to note that the gradient is not incorrect for CPU vs GPU, the instructions clearly state that for CPU you need to provide log_softmax(joint-logits) whereas for the GPU you should only provide joint-logits since the cuda kernel will efficiently compute the log_softmax internally.

Anyway yours is also an efficient implementation, also written in c++, could you benchmark the solutions if you have time ? Even a naive one would give some hint as to speed in relative terms. The memory efficient implementation of yours is very interesting too, which reduces speed but saves a lot of memory.

titu1994 commented 2 years ago

Closing as it's not an issue, just a comment.

csukuangfj commented 2 years ago

Just wanted to note that the gradient is not incorrect for CPU vs GPU, the instructions clearly state that for CPU you need to provide log_softmax(joint-logits) whereas for the GPU you should only provide joint-logits since the cuda kernel will efficiently compute the log_softmax internally.

No, they are indeed different for the same input, i.e., when using logits as input, not log_softmax.

I know that warp_transducer requires output from log_softmax as input for its CPU version. The warp_transducer version I am using is from https://github.com/b-flo/warp-transducer/blob/839f7598856ee62178aa2158090cda1c875b6a78/pytorch_binding/warprnnt_pytorch/__init__.py#L70

    if not acts.is_cuda:
        acts = torch.nn.functional.log_softmax(acts, -1)

which is the one used by espnet

I am comparing the gradients for logits.

I just created a colab notebook to show this. Please see

https://colab.research.google.com/drive/1vMkH8LmiCCOiCo4KTTEcv-NU8_OGn0ie?usp=sharing

The above colab notebook shows not only the gradients between CPU and CUDA differ but also the loss.

csukuangfj / optimized_transducer

Warprnnt gradient for CPU #26