Closed titu1994 closed 2 years ago
Closing as it's not an issue, just a comment.
Just wanted to note that the gradient is not incorrect for CPU vs GPU, the instructions clearly state that for CPU you need to provide log_softmax(joint-logits) whereas for the GPU you should only provide joint-logits since the cuda kernel will efficiently compute the log_softmax internally.
No, they are indeed different for the same input, i.e., when using logits
as input, not log_softmax
.
I know that warp_transducer
requires output from log_softmax as input for its CPU version. The warp_transducer version I am using is from https://github.com/b-flo/warp-transducer/blob/839f7598856ee62178aa2158090cda1c875b6a78/pytorch_binding/warprnnt_pytorch/__init__.py#L70
if not acts.is_cuda:
acts = torch.nn.functional.log_softmax(acts, -1)
which is the one used by espnet
I am comparing the gradients for logits
.
I just created a colab notebook to show this. Please see
https://colab.research.google.com/drive/1vMkH8LmiCCOiCo4KTTEcv-NU8_OGn0ie?usp=sharing
The above colab notebook shows not only the gradients between CPU and CUDA differ but also the loss.
@csukuangfj Just wanted to note that the gradient is not incorrect for CPU vs GPU, the instructions clearly state that for CPU you need to provide log_softmax(joint-logits) whereas for the GPU you should only provide joint-logits since the cuda kernel will efficiently compute the log_softmax internally.
Anyway yours is also an efficient implementation, also written in c++, could you benchmark the solutions if you have time ? Even a naive one would give some hint as to speed in relative terms. The memory efficient implementation of yours is very interesting too, which reduces speed but saves a lot of memory.