Mixed precision for rocm

As I work through this example, I soon discovered the following error message:

This fp16_optimizer is designed to only work with apex.contrib.optimizers.* To update, use updated optimizers with AMP. I figured out that this makes sense, as I am using AMD HW with a ROCm implementation of Pytorch. Still, the training times I get from using one node with 8 GPUs is no where near the 10 hours reported in the configuration for 50k steps, using the same yaml file (without fp16).

This raises both broad and specific questions. To begin with the latter:

How can I use mixed precision on ROCm? (And, what kind of speedup should I expect?)
Broadly speaking, are there special considerations using ROCm in performance terms, which affects the choices of optimizer, batch sizes, or parallelization type?

OpenNMT / OpenNMT-py

Mixed precision for rocm #2402