Implement true DQN parallel training

Baylus commented 3 months ago

So our DQN training is exceptionally slow. Currently my projection for 10,000 episodes is 550 days, and thats only testing for a sample size of 50 episodes. Later in the training we should be reaching longer games and will need more time to process each individual game. I am not sure how many episodes our use case will need, but just doing 50 takes multiple days.

So, we may have to bend the rules a bit and try to get a parallel implementation going.

Baylus commented 3 months ago

One method would be to train multiple DQN agents at the same time, then periodically by some sharing plan (share periodically in last 25%, sync weights favoring performance, etc.), we can share the model weights that each agent has trained,

Pros:

Allows us to use process-based parallelism, which could speed my training up by maybe up to 12 times.

Cons:

Betrays the true nature of DQN, since they are not sharing their replay buffer nor they model weights (depending on the delay dictated by sharing schedule), until later on when they could have a chance to diverge.
Because DQN replay buffer is not on the same as the others, DQN agents could create strategies that work well for them, but once mixed they will all suffer because they are getting a mix of many different strategies used. In our case, this could be as simple as one agent preferring to play on the bottom of the board, while the other prefers the top. I am unsure how serious this could be in our use case, and exploration would be interesting, even if only for the sake of testing what applications this strategy does not work for.

Baylus commented 3 months ago

Create process based parallelism to spread out the work needed for each turn during replay training. This would require IPC not only between the sibling processes, but also the parent. Specifically sharing the replay buffer and the current model. This may not be helpful consider than they would need to hold locks with each other when they were accessing their shared model, so at some point more processes wouldn't help.

Pros:

Could provide a moderate speed boost
Preserves the true DQN nature of single process training, because the Agent's model would still be trained linearly, where each and every replay from the minibatch would not only predict but train the model based on the work done by other sibling processes.

Cons:

Maybe large overhead with having to share not only the replay buffer but the model too, such that the performance improvement would no longer be valuable.
May have a limited amount of value considering locks need to be obtained, blocking sibling processes.

The issue of the locking could be mitigated if we instead decoupled the models and let each process have a second of the minibatch from replays to train their own model on, return their weights after all their training, and then averaging the weights from each of the child processes.

This seems like it would be a hybrid between having separate DQN agents training different models and a single process training method with no parallelism.

Pros:

Would significantly improve performance compared to any thread based optimizations
Could allow for larger minibatches, which results in more refined training for the number of episodes trained, due to taking longer to explore each minibatch before making a move that could be stored in the replay that would then be lower quality than the one that it is pushing off the end of the memory.

Cons:

Still betrays a truly pure DQN training methodology by having each process work independently, then simply averaging out the learning done by each sub process.
May be small enough processes that live short enough that the overhead of spinning them up would cut into the improvement that they could yield.

Going even further, we may be able to do some really interesting stuff where on each turn we don't just analyze the batch size to train the model on, but instead what if we spun up entirely different processes that would each grab a different minibatch and train the input model on it, then return the weights of that model. Then we average out the models that were trained and use that for our new model going into the next turn.

Pros:

Higher quality training
May reduce the number of episodes necessary to reach optimal result

Cons:

Does not improve per-episode training time.
Suffers from many of the same issues as the above solutions

Baylus / 2048I

Implement true DQN parallel training #13