Open breakds opened 2 years ago
Essentially, a DDP module wraps a set of parameters, and a function f
(as the forward()
of the DDP module). After the wrapping, when forward()
(i.e. f
) is called, all the result and intermediate result that depends on the parameters will be marked and autograd hooks are injected for those results.
Later when the results' backward()
is called, those hooks will invoke reducer (synchronization between subprocesses).
Note that the results' backward()
can be either directly called or indirectly (i.e. as a result of calling backward()
on values that depends on them) called. This means that as long as in loss(f(x))
all the to-be-updated parameters are used inf
, we only need to wrap f
(as opposite to having to wrap loss()
).
The answer is yes. Theoretically it works and I coded an experiment to verify that. It is worth noting that if you have more than 1 DDP wrapped modules, the order of calling in different subprocesses needs to be exactly the same. Because of how DDP works, if the order is different, the reducer of module A in process 1 might be waiting for its counterpart in process 2, while in the process 2 the reducer of module B is waiting for its counterpart in process 1 - effectively a textbook example of deadlock.
While working on enabling @data_distributed
decorator for the off-policy branch, I hit a blocker that at the initial sync, there will be exception complaining: "Tensors must be CUDA and dense".
After some digging I found that the problem comes from the fact that when DDP start to sync (reduce), it will sync the buffers of the wrapped module as well. All the offending buffers are within the replay buffer. I am working on a generic way to rule them out before being wrapped by the DDP.
One of the problems is that the replay_buffer
can be found in named_buffers()
but not in state_dict()
of the wrapped module. Investigating the reason now.
After explicit filtering out _replay_buffer
in named_buffers()
, I was able to successfully train ppo_cart_pole
with DDP.
With pretty much half the training time (although it does not account for much in each training iteration under this setting and this project).
(Dark blue is DDP, with 2x GPU)
With some hack I was able to run PPG with DDP on two 3080s. Below is the comparison of the same setup trained on
Note that the DDP version did better when looking at the by env steps graph:
Also, the time consumed is less than single 3090:
It is actually not 2x but 1.5x faster. I think one of the factor is that 3090 has a better performance than a single 3080.
Another reason could be in this hacky version I had to let DDP figure out what parameters are "unused" which adds overhead. I am still working on remove those hacks.
I got stuck on how find_unused_parameters
is working for DDP, which is crucial to PPG. I suspect there are bugs in find_unused_parameters
or hidden assumptions that I am not aware of. Will need to have more experiments.
The reason we need it for PPG is that PPG's network's auxiliary output is not used for policy phase update, but only in auxiliary phase update. Therefore corresponding parameters becomes "unused", and DDP does not like that as it is waiting for hooks to be called on all parameters.
The above issues can be resolved by #1114 and #1117
When turning on DDP, PPG + Metadrive can get stuck after several iterations (or several hundreds of iterations) arbitrarily. To make sure that it is DDP causing the problem, I also ran another training without DDP, and the result looks good. See below for the comparison.
TimeLimit
wrapper may help.See the red line below, when the auxiliary phase is turned off (effectively PPO), the getting stuck problem did no reproduce and the training dynamics seemed normal (it is not as efficient as PPG which is in general a fact we know).
....
INFO:absl:[rank = 0] Run _update() of b945/960, u = 0
Perform _compute_train_info_and_loss_info with [rank 0]
INFO:absl:[rank = 1] Run _update() of b945/960, u = 0
Perform _compute_train_info_and_loss_info with [rank 1]
INFO:absl:[rank = 0] End u = 0, b = 945
INFO:absl:[rank = 1] End u = 0, b = 945
INFO:absl:[rank = 0] Run _update() of b0/960, u = 1
Perform _compute_train_info_and_loss_info with [rank 0]
Explanation of the above debugging log, see below.
Further debugging shows that when getting stuck, it is inside the _update()
of the PPGAuxAlgorithm
(i.e. auxiliary phase update). We are about to complete 6 updates (u
goes from 0 to 5) per process (there are 2 processes, rank 0 and rank 1). Everything was fine until the first update of both processes complete. In the next update (u = 1
), only the process with rank = 0
called _compute_train_info_and_loss_info
, but not the one with rank = 1
. Because of DDP needs to synchronize when both process has finished calling this function, it will wait forever.
Was able to pin point the problem at
experience = alf.nest.map_structure(lambda x: x[indices],
experience)
Where one of the process got stuck here, which is outside the DDP-wrapped code. This is consistently reproducible on 2 different machines.
The above code comes from https://github.com/HorizonRobotics/alf/blob/pytorch/alf/algorithms/algorithm.py#L1372
Both experience
and indices
are on cpu, so this is a CPU operation.
This one seems related: https://discuss.pytorch.org/t/training-get-stuck-at-some-iteration-step/48329 There does not seem to be a solution yet.
Latest experiment result - after moving shuffle into per mini_batch, it worked around the previous stuck point. However, it then freezes at calling the DDP wrapped function _compute_train_info_and_loss_info
.
Can rule out find_ununsed_parameters
as the cause. I tried a hack that worked around find_unused_parameters
and the problem persists.
This is a follow-up to #913
Motivation
Add full support for multi-process and multi-GPU training in alf with pytorch's DDP.
Goals
forward
of the wrapped DDP module is equivalent to calling the original function, with distributed hooks added to the result #1098unroll()
should not go through DDP in off-policy branch #1114While achieving the main goals above, we should also make sure that the following specific use cases are considered.
backward
andoptimizer
(e.g. target updater in SAC). Make sure that the behavior is consistent with the non-distributed version.num_env_steps
SIGINT
, there are defunct zombie processes leftBlockers and Issues: