Change "grads" tensor to be pre-allocated

Currently, the _CTC.apply function allocates a "grads" tensor the same size as "acts" with every call of the function. In situations with large label size (Eastern languages, gram/word-level labels), this can cause a big slowdown, as we are consistently allocating and copying over data to the GPU for every call to CTC loss. Since this function is run in a training loop and we know the max batch size, max sequence length, and label length beforehand, we can allocate this gradient tensor in the CTCLoss function. When we call forward we just zero the tensor and slice into the tensor for the current input's sequence length and batch size. This can cause up to a ~10x speed up in CTCLoss.

Possible complications:

if you're really tight on memory space, keeping this always allocated might be a problem
CTCLoss() _init_ call changes and requires updating codebases.

SeanNaren / warp-ctc

Change "grads" tensor to be pre-allocated #112