Open Avmb opened 7 years ago
Hi, it might also not be worth it. If I am not wrong float16 is artifically capped in gamer hardware, i.e. the GTX 1080, to laughable performance, about ~30x slower GEMM. Not sure about Titan X though.
We'll hopefully get access to some subset of Jade (22 DGX-1 even though everybody lobbied them to buy normal Pascals), Peta5 (P100s on PCI Express), and Azure has private beta for Pascals. Totally worth it for those.
Oh. In that case carry on :)
Using a learning rate 10 times smaller prevents the NaN, though I still get that strange warning, only during training.
In terms of speed training is slightly faster on our machines, I will try to benchmark on a P100 if I have the chance. I didn't measure accuracy.
Interesting. Thing is, it should not be faster. F16 arithmetics are severly capped. We benchmarked cublas hgemm vs sgemm on a GTX1080 once, it was slower by a factor of 28x . And from what I read that's intentional.
Maybe Theano is doing something smart, or accidentally smart (like not using float16 for some Ops because they haven't implemented yet).
Yeah, maybe on the CPU as well? Are float16 operations faster on our CPUs?
Current Intel CPUs have float16 storage format but not operations. So there's an instruction to read at 16-bit float and expand to a 32-bit float then do the usual multiply or add instruction.
out of interest, do you know if you're likely to get overflows when using fp16, and if you're doing anything about it
On 14 June 2017 at 21:41, Kenneth Heafield notifications@github.com wrote:
Current Intel CPUs have float16 storage format but not operations. So there's an instruction to read at 16-bit float and expand to a 32-bit float then do the usual multiply or add instruction.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rsennrich/nematus/issues/43#issuecomment-308551891, or mute the thread https://github.com/notifications/unsubscribe-auth/AAqOFGUW565I2yGE0OmXC7lB6bdz1RSFks5sEEWFgaJpZM4N4_ge .
I ran more training benchmarks, including some on a Tesla P100 (thanks to Università di Pisa) and the results are that there is no noticeable difference between float32 and float16. Probably Theano backend still does not properly exploit float16, and does not even seem to handle it well in terms of numerical stability (I got NaNs for some hyperparameter settings)
As for the difference between the P100 and the TITAN X (Pascal), the TITAN X is actually equal or slightly faster, except when training with float64 (which is probably not very useful). I've tried with full-size models (--dim_word 512 --dim 1024) and batch size up to 256 and still got roughly the same speed between different machines.
feedback from my own work with fp16 in amun. When running on a P100 (wilkes) it gives about a 20% speedup over using fp32. Most of the speedup is in the large matrix multiplication at the output layer.
About to try again to speed up the rest of the code (element-wise operations etc) which requires much more work
In this branch, I removed all hardcoded references to float32 and I tried to train with float16, but it does not work:
Using cuDNN version 5105 on context None Mapped name None to device cuda0: TITAN X (Pascal) (0000:02:00.0) Loading data Building model Building sampler Building f_init... Done Building f_next.. Done Building f_log_probs... Done Computing gradient... Done Building optimizers...Disabling C code for Elemwise{Cast{float32}} due to unsupported float16 Done Total compilation time: 198.4s Optimization Seen 846 samples NaN detected
I've also tried increasing the epsilon in the Adam optimizer, but it doesn't solve the issue.