Open kohillyang opened 4 years ago
It is a bug. MXNet saves index_update_counts
by device_id. When device_id
is changed, index_update_counts
will be reset.
https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/optimizer/optimizer.py#L408
A temporary solution:
Use the environment variable CUDA_VISIBLE_DEVICES
to select the GPU devices, and fix the device id in the code.
CUDA_VISIBLE_DEVICES=1,2 python train.py
Also see this issue with distributed training framework, since the context info (device_id) are different among distributed processes, and is not saved as part of the checkpoint.
Description
As I know, the optimizer decides the num_update according to its _index_update_count saved on each device, which means that If the trainer states on one GPU device and loaded into another device, the behavior of lr_sheduler relying on num_update will be different.
This is a little confusing, because one case is that the GPUs are shared by the whole lab, and when I want to restore the trainer states, the GPUs available maybe different. At least when the number of GPUs is same, the behavior should be same, or at least I should see a warning, and if the GPUs are different, we should receive a error/warning.
To Reproduce
If I saved the states on GPU 2 and loaded the states on GPU 2 too, the lr sheduler work as I expect, and the num_update at last is 2:
If I saved the states on GPU2 and loaded it on GPU3, the lr sheduler then does not work as I expected, the num_update is 1 instead.
Steps to reproduce
(Paste the commands you ran that produced the error.)
1. 2.
What have you tried to solve it?
1. 2.
Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:
Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below: