breakds commented 2 years ago

This is a follow-up to #913

Motivation

Add full support for multi-process and multi-GPU training in alf with pytorch's DDP.

Goals

[x] A general DDP wrapper that can convert a function to DDP module, where calling the forward of the wrapped DDP module is equivalent to calling the original function, with distributed hooks added to the result #1098
[x] Update single machine multiple GPU training for on policy branch (e.g. standard Actor Critic) #1098
[x] Support single machine multiple GPU training on off policy branch (e.g. PPO) #1099
[ ] unroll() should not go through DDP in off-policy branch #1114
[ ] Investigate how reducer's bucket size affects the performance
[ ] Compare the performance with ZeroRedundancyOptimizer turned on and off

While achieving the main goals above, we should also make sure that the following specific use cases are considered.

[x] In a composite algorithm such PPG, all the networks involved in training should be properly distributed.
[x] If evaluation is turned on, it should only be turned on for process with rank = 0
[ ] There can be parameters that are not updated via backward and optimizer (e.g. target updater in SAC). Make sure that the behavior is consistent with the non-distributed version.
[ ] There can be other variables that can affect training but not synchronized via DDP (e.g. batch normalization). This can introduce inconsistency with the non-distributed version. Investigate whether such inconsistency can be a problem.
[x] Find a reasonable way to adapt the metrics
[x] Find a reasonable way to reinterpret the termination conditions such as num_env_steps
[ ] Resolve the problem when terminated by SIGINT, there are defunct zombie processes left
[ ] Run on 4 - 8 cards on Cluster

Blockers and Issues:

[ ] DDP + PPG + MetaDrive with default configuration may get stuck, and the performance is really bad. However, it is verified that DDP + PPG + Procgen reaches the performance parity (and even slightly better).

breakds commented 2 years ago

What exactly should DDP wrap?

Essentially, a DDP module wraps a set of parameters, and a function f (as the forward() of the DDP module). After the wrapping, when forward() (i.e. f) is called, all the result and intermediate result that depends on the parameters will be marked and autograd hooks are injected for those results.

Later when the results' backward() is called, those hooks will invoke reducer (synchronization between subprocesses).

Note that the results' backward() can be either directly called or indirectly (i.e. as a result of calling backward() on values that depends on them) called. This means that as long as in loss(f(x)) all the to-be-updated parameters are used inf, we only need to wrap f (as opposite to having to wrap loss()).

breakds commented 2 years ago

Can we have more than 1 DDP wrapped modules in a distributed training?

The answer is yes. Theoretically it works and I coded an experiment to verify that. It is worth noting that if you have more than 1 DDP wrapped modules, the order of calling in different subprocesses needs to be exactly the same. Because of how DDP works, if the order is different, the reducer of module A in process 1 might be waiting for its counterpart in process 2, while in the process 2 the reducer of module B is waiting for its counterpart in process 1 - effectively a textbook example of deadlock.

breakds commented 2 years ago

While working on enabling @data_distributed decorator for the off-policy branch, I hit a blocker that at the initial sync, there will be exception complaining: "Tensors must be CUDA and dense".

After some digging I found that the problem comes from the fact that when DDP start to sync (reduce), it will sync the buffers of the wrapped module as well. All the offending buffers are within the replay buffer. I am working on a generic way to rule them out before being wrapped by the DDP.

One of the problems is that the replay_buffer can be found in named_buffers() but not in state_dict() of the wrapped module. Investigating the reason now.

breakds commented 2 years ago

After explicit filtering out _replay_buffer in named_buffers(), I was able to successfully train ppo_cart_pole with DDP.

ddp_ppo_cart_pole

With pretty much half the training time (although it does not account for much in each training iteration under this setting and this project).

ddp_ppo_cart_pole_time

(Dark blue is DDP, with 2x GPU)

breakds commented 2 years ago

With some hack I was able to run PPG with DDP on two 3080s. Below is the comparison of the same setup trained on

Dark Blue - single 3090 with Intel CPU
Light Blue - DDP on dual 3080 with AMD threadripper

ddp_ppg_procgen_return

Note that the DDP version did better when looking at the by env steps graph:

ddp_ppg_return_by_env_steps

Also, the time consumed is less than single 3090:

ddp_ppg_return_time

It is actually not 2x but 1.5x faster. I think one of the factor is that 3090 has a better performance than a single 3080.

Another reason could be in this hacky version I had to let DDP figure out what parameters are "unused" which adds overhead. I am still working on remove those hacks.

breakds commented 2 years ago

I got stuck on how find_unused_parameters is working for DDP, which is crucial to PPG. I suspect there are bugs in find_unused_parameters or hidden assumptions that I am not aware of. Will need to have more experiments.

The reason we need it for PPG is that PPG's network's auxiliary output is not used for policy phase update, but only in auxiliary phase update. Therefore corresponding parameters becomes "unused", and DDP does not like that as it is waiting for hooks to be called on all parameters.

breakds commented 2 years ago

The above issues can be resolved by #1114 and #1117

breakds commented 2 years ago

When turning on DDP, PPG + Metadrive can get stuck after several iterations (or several hundreds of iterations) arbitrarily. To make sure that it is DDP causing the problem, I also ran another training without DDP, and the result looks good. See below for the comparison.

ddp_ppg_metadrive_issue

It occurs to me that maybe the "getting stuck" problem is caused by "Insanely long episode", so that MetaDrive simulator get stuck somehow. Adding TimeLimit wrapper may help.
Also, we expect the same or similar training dynamics with DDP turned on/off. Such dramatic difference indicates that there are some inconsistency and I need to find out why.

breakds commented 2 years ago

See the red line below, when the auxiliary phase is turned off (effectively PPO), the getting stuck problem did no reproduce and the training dynamics seemed normal (it is not as efficient as PPG which is in general a fact we know).

ddp_ppg_metadrive_issue_1

breakds commented 2 years ago

....
INFO:absl:[rank = 0] Run _update() of b945/960, u = 0
Perform _compute_train_info_and_loss_info with [rank 0]
INFO:absl:[rank = 1] Run _update() of b945/960, u = 0
Perform _compute_train_info_and_loss_info with [rank 1]
INFO:absl:[rank = 0] End u = 0, b = 945
INFO:absl:[rank = 1] End u = 0, b = 945
INFO:absl:[rank = 0] Run _update() of b0/960, u = 1
Perform _compute_train_info_and_loss_info with [rank 0]

Explanation of the above debugging log, see below.

Further debugging shows that when getting stuck, it is inside the _update() of the PPGAuxAlgorithm (i.e. auxiliary phase update). We are about to complete 6 updates (u goes from 0 to 5) per process (there are 2 processes, rank 0 and rank 1). Everything was fine until the first update of both processes complete. In the next update (u = 1), only the process with rank = 0 called _compute_train_info_and_loss_info, but not the one with rank = 1. Because of DDP needs to synchronize when both process has finished calling this function, it will wait forever.

breakds commented 2 years ago

Was able to pin point the problem at

                experience = alf.nest.map_structure(lambda x: x[indices],
                                                    experience)

Where one of the process got stuck here, which is outside the DDP-wrapped code. This is consistently reproducible on 2 different machines.

The above code comes from https://github.com/HorizonRobotics/alf/blob/pytorch/alf/algorithms/algorithm.py#L1372

Both experience and indices are on cpu, so this is a CPU operation.

breakds commented 2 years ago

This one seems related: https://discuss.pytorch.org/t/training-get-stuck-at-some-iteration-step/48329 There does not seem to be a solution yet.

breakds commented 2 years ago

Latest experiment result - after moving shuffle into per mini_batch, it worked around the previous stuck point. However, it then freezes at calling the DDP wrapped function _compute_train_info_and_loss_info.

breakds commented 2 years ago

Can rule out find_ununsed_parameters as the cause. I tried a hack that worked around find_unused_parameters and the problem persists.

HorizonRobotics / alf

Multi-GPU Training with DDP #1096

Motivation

Goals

Blockers and Issues:

What exactly should DDP wrap?

Can we have more than 1 DDP wrapped modules in a distributed training?