Baylus / 2048I

Teaching ML to play 2048 better
MIT License
0 stars 0 forks source link

Implement true DQN parallel training #13

Open Baylus opened 1 month ago

Baylus commented 1 month ago

So our DQN training is exceptionally slow. Currently my projection for 10,000 episodes is 550 days, and thats only testing for a sample size of 50 episodes. Later in the training we should be reaching longer games and will need more time to process each individual game. I am not sure how many episodes our use case will need, but just doing 50 takes multiple days.

So, we may have to bend the rules a bit and try to get a parallel implementation going.

Baylus commented 1 month ago

One method would be to train multiple DQN agents at the same time, then periodically by some sharing plan (share periodically in last 25%, sync weights favoring performance, etc.), we can share the model weights that each agent has trained,

Pros:

Cons:

Baylus commented 1 month ago

Create process based parallelism to spread out the work needed for each turn during replay training. This would require IPC not only between the sibling processes, but also the parent. Specifically sharing the replay buffer and the current model. This may not be helpful consider than they would need to hold locks with each other when they were accessing their shared model, so at some point more processes wouldn't help.

Pros:

Cons:

The issue of the locking could be mitigated if we instead decoupled the models and let each process have a second of the minibatch from replays to train their own model on, return their weights after all their training, and then averaging the weights from each of the child processes.

This seems like it would be a hybrid between having separate DQN agents training different models and a single process training method with no parallelism.

Pros:

Cons:

Going even further, we may be able to do some really interesting stuff where on each turn we don't just analyze the batch size to train the model on, but instead what if we spun up entirely different processes that would each grab a different minibatch and train the input model on it, then return the weights of that model. Then we average out the models that were trained and use that for our new model going into the next turn.

Pros:

Cons: