arnomoonens / yarll

Combining deep learning and reinforcement learning.
MIT License
81 stars 28 forks source link

Actor Critic for Mountain Car #3

Closed zencoding closed 7 years ago

zencoding commented 7 years ago

Hi, Thanks a lot for the wonderful work, the code is well written and modular, it helped me a lot to play with RL in general. I want to know if you have an opinion on using Actor Critic methods on control problem such as Mountain Car. I added Mountain Car to the environment and ran A2C and A3C, both of them never converged to a solution. I modified the hyperparameter discount factor to 0.9 and got both Actor loss and Critic Loss to be zero but the reward never reduced below -200. I then tried adding Entropy to improve exploaration but that leads to exploding gradients. Looking deeper, it looks like Actor Critic methods are not good at exploring the space if the return is constant.

I looked at the other implementations of solving Mountain Car and it is solved either using Function Approximation (Tile coding) or DPG, there was no one who had used Actor Critic.

arnomoonens commented 7 years ago

Hello,

Thank you for the feedback about my code! I'm glad that I can help other people using my repository.

I have experienced the same issue with the MountainCar-v0 environment. The problem is that we have an on-policy method (A2C and A3C) applied to an environment that rarely gives useful rewards (i.e. only at the end).

I have only used Sarsa with function approximation (not DPG), and I believe this algorithm works quite good on the MountainCar-v0 environment because in this case it favors actions that haven't been tried yet in the current state. This is the case because the thetas are initialized randomly uniform. Whenever a reward (for this environment -1) is received, it only changes thetas for the previous state and action. I haven't studied and implemented DPG yet. I am interested in how that algorithm is able to "solve" this environment.

In contrary to Sarsa+F.A., updates using A3C can influence all the parameters (in this case the neural network weights) and thus the result for every state (the input to the neural net of the actor) can be influenced. I ran an experiment, and the network always seems to output the same probabilities, as the feedback to the network is also always the same. Thus, you can only get at the finish by luck. Once the agent "discovered" the finish, the performance should improve. In fact, some people report to have successfully learned using A3C.

I hope my explanation is clear. Feel free to ask more questions otherwise. I also don't fully understand it yet. Unfortunately, I don't have enough time right now to investigate the problem more thoroughly.


By the way, the weights of the networks in my A2C and A3C algorithms weren't initialized properly. The standard deviation was 1, which is too big and can lead to a big difference in probabilities for the action to be selected. Sometimes an action had only a probability of 0.5% for example. As I explained, the probabilities never change much and thus sometimes an action is rarely selected. I changed it now (commit 6a0d879dd254dcf5ccac86ac9958e203ddb4e1c9) by using as weight initializer tf.truncated_normal_initializer(mean=0.0, stddev=0.02).

zencoding commented 7 years ago

Thanks for your explanation, that helps. It seems that on-policy have bad exploration compared to off-policy so, in situations where the rewards are not changing with state changes, it is better to use off-policy methods.

BTW, I tried various things on A2C to make it work such as added reward for movement

` for _ in range(self.config["repeat_nactions"]): state, rew, done, = self.step_env(action) stateDelta = np.mean(np.square(state-old_state))

Good rewards if agent moved the car

            if stateDelta > 0.0001:
                rew = 0
            if done:  # Don't continue if episode has already ended
                break

and Experience Replay and epilson greedy if np.random.rand() <= self.config["epsilon"]: action = np.random.randint(0,3,size=1)[0] else: action = self.choose_action(state)` but the network still won't converge to less than 200 steps. I don't know why but I will investigate.

Thanks again for your help in understanding