MatheusMRFM / A3C-LSTM-with-Tensorflow

An implementation of the A3C deep reinforcement learning method using a LSTM layer. Created with Tensorflow.
29 stars 11 forks source link

Probles with training #3

Closed Palkos83 closed 6 years ago

Palkos83 commented 6 years ago

Hi

Firstly, really nice code. It helped me to understand A3C fundamentals. However I do struggle to get it to converge. I tried at least four different implementations and most of them are having this same issue. Yesterday I tried your branch and kicked off the pong training out of the box. The only change I made was to use 8 workers as my rig has 8 cores. Sadly even after 1.8M of epochs it is not converging. The agent barely moves the towards the ball.

Here is a screenshot from Tensorboard: image

I came across your notes on stackoverflow where you seemed to have sorted the issue by adding small number to the policy output avoiding NaNs. https://stackoverflow.com/questions/44926583/cant-get-my-a3c-with-lstm-layer-using-tensorflow-to-work However that do not seem to match your current code.

Any ideas what might be the problem? Or how to even start debugging it to try to determine that? Many thanks for help in advance.

MatheusMRFM commented 6 years ago

Hello there! First of all, thanks for using my code! I'm glad it helped you better understand the A3C method.

Regarding your problem, I think the issue here is that you left it running only for a small amount of time. Before I uploaded this code, I remember using it to train an agent for Pong (both Pong-v4 and PongDeterministic-v4) and Breakout-v4. Both achieved good results. For instance, here is the reward curve for the PongDeterministic:

7e-4 30 84

and here is for Pong-v4:

7e-4 30 84

Notice that the agent only starts to increase its score somewhere around 3M frames. Therefore, try leaving the agent training for longer and see what happens.

But just to make sure, I will try running my github version to see if there is any problem. I return to when I have my new results.

As for my thread in stackoverflow, I ended up using the "tf.clip_by_value" to avoid zeros in the policy, which in turn avoids NAN when passing the policy to the log function (line 196 of Network.py).

Palkos83 commented 6 years ago

Many thanks, for super fast answer. I will try leaving it for longer than. Just out of curiosity, how long does it take you to train 1M steps? I was running the code on I7 7700K + GTX1080 GPU (i am not sure if your code benefits from it though) and 1.8M steps took nearly a day. That was with 8 workers running as well.

Is that expected?

Once again, many thanks for help.

MatheusMRFM commented 6 years ago

I left the code running right after I read your first question (which was about 3 hours ago) and, at the moment, I'm already at 2.695.268 steps (score still remains approximately -21). Therefore, I think that perhaps there is some problem with your tensorflow installation. My guess is the number of workers you are using: I made a quick search and it seems that your CPU has in fact 4 processors, not 8. In this case, I recommend using only 4 workers, in order to use 1 thread per processor (although your processor seems to have hyper-threading). My CPU is also an i7 quad core (don't remember the exact name), and using 4 workers usually results in faster runs.

Make this experiment and see if works. As a final comment, note that my code doesn't support GPU, because I have no CUDA device in my PC :(

Palkos83 commented 6 years ago

Many thanks for response. You are right it is 4 physical cores and 8 threads. I reduced the number of workers back to 4 and turned off the rendering of the workers as well (this is what made the most of the difference I presume as it started to go through the steps much faster - about 2M steps per hour).

And, the best news is - it trained as you predicted, all I had to be patient :)

image

That is great, working really nice. I will try breakout and other games now as well.

Great job, nice code, very helpful!

MatheusMRFM commented 6 years ago

Glad to hear that this had a happy ending :)

I forgot to ask if you left the 'render' flag to True. Indeed this uses a lot of extra processing when compared to running without rendering.

Feel free to ask if you have any other questions.