Closed jonashaag closed 4 years ago
- Create zeros() directly on GPU rather than move from CPU to GPU.
- Allow for num_workers > 1 (move .cuda() out of loader)
- Don't recompute batch_loss 3x
- Use cudnn.benchmark
- Use pin_memory
For me it's a training speed improvement of ~8x
Thanks for your constructive opinion! Indeed there is much optimization space for the project and you really improved it! Since I am occupied with other works recently and I will carefully modify the codes in the near future and makes the project easier to use.
For me it's a training speed improvement of ~8x