baidu-research / warp-ctc

Fast parallel CTC.
Apache License 2.0
4.06k stars 1.04k forks source link

infinite CTC costs #23

Closed shaobohou closed 8 years ago

shaobohou commented 8 years ago

Apologies if I misunderstood something, but running the following code seems to return infinite CTC costs, though the gradients are fine.

th> require 'warp_ctc'
th> acts = torch.Tensor({{0,-150,0,0,0}}):float()
th> grads = torch.zeros(acts:size()):float()
th> labels = {{1}}
th> sizes = {1}
th> cpu_ctc(acts, grads, labels, sizes)
{
  1 : inf
}
th> print(grads)
 0.2500  0.0000  0.2500  0.2500  0.2500
[torch.FloatTensor of size 1x5]

Is this simply something that we have to guard against in our own Softmax code?

ekelsen commented 8 years ago

the probability of the correct label will be 0 (e^-150 / 4 isn't representable in float32). This will cause an infinite cost. There is, in general, no way around this problem. Even if the internal precision was double, it would be easy enough to generate a counter example by making the activation -746 instead of -150.

Are you encountering this problem training an actual network, or is this an artificial example?

We could consider adding an option for enforcing a minimum probability for each label as a possible solution.

shaobohou commented 8 years ago

I came across the problem while training an actual network, which I then narrowed down to this artificial example. I had thought that if the inputs, outputs and calculations are all in log-scale, then this wouldn't be a problem, so I was expecting (-150)-log(4) as the result.

In my case, this happens very early on during training, so I think I can probably fix it during initialisation.

ekelsen commented 8 years ago

The softmax isn't currently done in log-scale - just the calculations inside CTC. If you can't change it with initialization, then we can make doing the softmax in log scale a priority.

shaobohou commented 8 years ago

There was definitely something funny going on with my initialisation, I have fixed it for now by falling back to simple initialisation from a uniform distribution, so this is not urgent and I am happy to close the issue if you are.

It would still be nice to have the softmax in log scale though.

Thanks for your help.

ekelsen commented 8 years ago

Doing the softmax in logscale or adding the option is something we are considering doing. Will close this for now.

ekelsen commented 8 years ago

I'm not ready to pull this into the mainline yet, but you can try this branch which does the softmax in logspace: https://github.com/ekelsen/warp-ctc/tree/log_softmax

shaobohou commented 8 years ago

Sorry about the late reply, I didn't have time to do a proper test but looks good on the small dataset I tried it on.

CCorfield commented 8 years ago

I have also run into infinite losses using warp-ctc. I added code to print out the max values of the predictions tensor, and sure enough there were (sufficiently) large positive numbers that suggested floating point overflow. However, when I look at the code, the softmax function appears to be normalizing the passed in values by subtracting the maximum value it finds (on each row of labels) -- which ought to prevent floating point overflows. Am I looking at the right version/branch, or is there some other code path that I am overlooking which is vulnerable to floating point overflows?

ekelsen commented 8 years ago

All versions perform such normalization. It is not enough to prevent floating point issues. However, as a pedantic point, if the the numbers are large but finite, then overflow has not occurred.

Doing the softmax in log space would likely fix these numerical issues, but it can introduce others as it results in less precision. In some of our experiments, using it leads to slightly worse training performance which is why the log_softmax branch has not been merged - but you should try it and see if it solves your particular issue.

CCorfield commented 8 years ago

Having raised the issue that I too have encountered infinities, my hunch is that it is a little too easy to "blame" the loss function. For example, if the gap between the max output value and the second largest output value is too big, that too would trigger floating point over/under flows. So, a subtler touch earlier in the network would be a better strategy. Torch does not yet have a cudnn implementation of a clipped ReLU; the closest thing would be nn.HardTanh(0,20) to mimic the DS II paper's clipped ReLU settings. As I write this I don't know whether nn.HardTanh() would take advantage of GPU functionality or impact performance by causing data transfer between GPU and CPU (if you know the answer and are kind enough to share -- let me know!). I'll also give the log version a try, and report back (it will take a few days to turn around) -- happy to be a part time QA engineer for you.