EdinburghNLP / nematus

Open-Source Neural Machine Translation in Tensorflow
BSD 3-Clause "New" or "Revised" License
797 stars 269 forks source link

Float16 does not work #43

Open Avmb opened 7 years ago

Avmb commented 7 years ago

In this branch, I removed all hardcoded references to float32 and I tried to train with float16, but it does not work:

Using cuDNN version 5105 on context None Mapped name None to device cuda0: TITAN X (Pascal) (0000:02:00.0) Loading data Building model Building sampler Building f_init... Done Building f_next.. Done Building f_log_probs... Done Computing gradient... Done Building optimizers...Disabling C code for Elemwise{Cast{float32}} due to unsupported float16 Done Total compilation time: 198.4s Optimization Seen 846 samples NaN detected

I've also tried increasing the epsilon in the Adam optimizer, but it doesn't solve the issue.

emjotde commented 7 years ago

Hi, it might also not be worth it. If I am not wrong float16 is artifically capped in gamer hardware, i.e. the GTX 1080, to laughable performance, about ~30x slower GEMM. Not sure about Titan X though.

kpu commented 7 years ago

We'll hopefully get access to some subset of Jade (22 DGX-1 even though everybody lobbied them to buy normal Pascals), Peta5 (P100s on PCI Express), and Azure has private beta for Pascals. Totally worth it for those.

emjotde commented 7 years ago

Oh. In that case carry on :)

Avmb commented 7 years ago

Using a learning rate 10 times smaller prevents the NaN, though I still get that strange warning, only during training.

In terms of speed training is slightly faster on our machines, I will try to benchmark on a P100 if I have the chance. I didn't measure accuracy.

emjotde commented 7 years ago

Interesting. Thing is, it should not be faster. F16 arithmetics are severly capped. We benchmarked cublas hgemm vs sgemm on a GTX1080 once, it was slower by a factor of 28x . And from what I read that's intentional.

Avmb commented 7 years ago

Maybe Theano is doing something smart, or accidentally smart (like not using float16 for some Ops because they haven't implemented yet).

emjotde commented 7 years ago

Yeah, maybe on the CPU as well? Are float16 operations faster on our CPUs?

kpu commented 7 years ago

Current Intel CPUs have float16 storage format but not operations. So there's an instruction to read at 16-bit float and expand to a 32-bit float then do the usual multiply or add instruction.

hieuhoang commented 7 years ago

out of interest, do you know if you're likely to get overflows when using fp16, and if you're doing anything about it

On 14 June 2017 at 21:41, Kenneth Heafield notifications@github.com wrote:

Current Intel CPUs have float16 storage format but not operations. So there's an instruction to read at 16-bit float and expand to a 32-bit float then do the usual multiply or add instruction.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rsennrich/nematus/issues/43#issuecomment-308551891, or mute the thread https://github.com/notifications/unsubscribe-auth/AAqOFGUW565I2yGE0OmXC7lB6bdz1RSFks5sEEWFgaJpZM4N4_ge .

Avmb commented 7 years ago

I ran more training benchmarks, including some on a Tesla P100 (thanks to Università di Pisa) and the results are that there is no noticeable difference between float32 and float16. Probably Theano backend still does not properly exploit float16, and does not even seem to handle it well in terms of numerical stability (I got NaNs for some hyperparameter settings)

As for the difference between the P100 and the TITAN X (Pascal), the TITAN X is actually equal or slightly faster, except when training with float64 (which is probably not very useful). I've tried with full-size models (--dim_word 512 --dim 1024) and batch size up to 256 and still got roughly the same speed between different machines.

hieuhoang commented 6 years ago

feedback from my own work with fp16 in amun. When running on a P100 (wilkes) it gives about a 20% speedup over using fp32. Most of the speedup is in the large matrix multiplication at the output layer.

About to try again to speed up the rest of the code (element-wise operations etc) which requires much more work