Open tsdalton opened 4 years ago
What version of Cuda are you using? For versions earlier than 9.1, bmm was known to be slow in fp16. The current version of apex should keep it in fp32 for such versions, though.
Added versions above: I'm using CUDA 10.1
I have a Seq2Seq network with attention and when training with Apex/O1 optimization I notice that mixed precision is more than 3x slower. It seems that BMM is the culprit. Any ideas why this is happening?
Pytorch 1.3 CUDA 10.1 CuDNN 7.6.4 NVIDIA Tesla K80
FP32
Mixed Precision (O1)