Closed Tilps closed 2 years ago
Seems multi-gpu for some reason caches the result of the optimizer's LR requesting function across steps. This means the first LR retrieved gets used for all future steps. Changing active_lr to be a variable causes things to be processed correctly.
Seems multi-gpu for some reason caches the result of the optimizer's LR requesting function across steps. This means the first LR retrieved gets used for all future steps. Changing active_lr to be a variable causes things to be processed correctly.