Performance improvements

Create zeros() directly on GPU rather than move from CPU to GPU.

Allow for num_workers > 1 (move .cuda() out of loader)

Don't recompute batch_loss 3x

Use cudnn.benchmark

Use pin_memory

For me it's a training speed improvement of ~8x

Thanks for your constructive opinion! Indeed there is much optimization space for the project and you really improved it! Since I am occupied with other works recently and I will carefully modify the codes in the near future and makes the project easier to use.

Andong-Li-speech / RTNet

Performance improvements #4