Optimizer initialization should set rescale_grad appropriately

This is potentially an ask for addressing at the Module API level, but it is more obvious with the Keras integration.

When creating an instance of mx.optimizers.Optimizer if the value of rescale_grad is not specified, the default value of 1.0 has a significant impact on training. In fact, this is pointed out as a warning in the logs, when the optimizer is initialized.

/usr/local/lib/python2.7/site-packages/mxnet/module/bucketing_module.py:408: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.0078125). Is this intended?
  force_init=force_init)

Since the MXNet implementations of the Keras optimizers, essentially delegate to the Module versions, this parameter should likely be configured to the normalized value, as it is not obvious from the Keras API. It is possible to provide rescale_grad as an additional argument, but that requires the user to know some of the details of both frameworks.

awslabs / keras-apache-mxnet

Optimizer initialization should set rescale_grad appropriately #210