Open varunrajk opened 5 years ago
@mxnet-label-bot [KVStore, Thread Safety]
Hi @szha @frankfliu, is there any way to get around this? I have 2 sets of parameters in the same model, each needing different optimizer and optimizer_params. When I create 2 gluon trainers with same kvstore, I run into this issue.
Any updates here ? Got the same problem.
Description
The Gluon Trainer
step
method uses enumerations as keys to push and pull gradients/parameters from kvstore. Using two trainers within a single worker script (in a distributed learning setting) can cause an issue because each trainer uses the same set of keys on the distributed KVStore.Minimum reproducible example
Here is a simple example script that demonstrates this issue:
The trainer_test method tests a single step update on a simplified regression problem. It takes as inputs
Steps to reproduce
Execute the script by using mxnet's
launch.py
tool with 2 or more workers and alocal
launcher as follows:The script execution freezes after evaluating the first
trainer_test
call.What have you tried to solve it?
Replacing the kvstore keys in the trainer to use unique parameter names (for example,
param.name
) solves this issue.