Issues with learning in custom environment

faustomilletari commented 7 years ago

Hello,

I'm writing you to discuss a problem that is not directly related with your code and application but is affecting my own efforts to apply RL on a more custom type of problem where we have an environment which is not atari-like.

In my custom environment I have noisy, difficult to interpret images which are my observations, and i can take a bunch of actions. For each image between 1 and 4 actions can be considered correct and between 8 to 5 actions are always incorrect.

This problem can be also formulated in a fully supervised manner, as a classification problem, where we ignore the fact that more than one action can be considered correct at a time and that these actions are related to each other over time and define a "trajectory" of actions. When we use the supervised approach in the way i just described the system works well, meaning that there is no struggle interpreting those noisy images that are difficult to understand by humans.

When these images are organized in a structured manner, in an environment one can play with, it's possible to use a RL algorithm to solve the problem. We have tried with satisfactory results DQN, which works okay. In that case the reward signal is provided continuously, for an action that goes in the right direction we assign a reward of 0.05, for an action that goes in the wrong one -0.15, for a "done" action issued correctly +1 and for a "done" actions issued incorrectly -0.25. A "done" action doesn't terminate the episode (it is terminated after 50 steps). DQN in these settings very slowly converges and shows nice validation results.

When we employ A3C, the behavior is either:

Reward goes up for a bit, then stabilizes to a very poor value, never moves again (we tried also dampening beta as we optimize and we got an identical behaviour)
Reward fluctuates and then drift in a way that for 50 steps of the episode a correct action is NEVER done. meaning the -0.15 reward is always the one that is obtained for all the steps, like the network would have learned perfectly how not to do things.

I am very puzzled by this behavior. I have checked and re-checked every moving piece. the environment, the inputs to the network in terms of images, the rewards, the distribution of different actions over time (such that i see if the network is just learning to issue always the same action for some reason). All these problems seem to be absent. I thought it was a problem of exploration vs. exploitation and therefore i reduced first and then increased beta. i have also dampened beta over time to see what happens, but the most i obtained was a sinusoidal kind of behavior of the reward.

I also have tried to use a epsilon greedy strategy (just for debug purposes) instead of sampling from the policy distribution, with no success (network converges to the worst possible scenario rather quickly).

I tried reducing learning rate with no success.

Now, the policy gradient loss is not exactly the same as the cross-entropy loss but it resembles it quite a bit. With an epsilon greedy policy i would expect that for each image (we have a limited number of images (observations) that are re-proposed when the environment reaches a similar state) all the possible actions are actually explored and therefore the policy is learned in a way that is not so far away from the supervised case. If i set discount factor to zero (which i have tried), the value part of the network does not really play a role (i might be mistaken though) and if i give a reward for each step i take i should kind of converge to something that resembles my classification approach that i described above.

Maybe the fact that multiple actions can generate the same reward or penalty is the problem ?!

I would immensely appreciate any help or thought coming from you. Despite I'm really motivated about applying RL to my specific problem I really don't know what to do to improve the situation.

Thanks a lot,

Fausto

ifrosio commented 7 years ago

It's hard to comment as you are doing the experiments in first person, but I'll try anyway. The possible experiments that come to my mind are, not necessarily in the proper order:

Try giving a positive / negative reward only when done, avoid partial rewards; overall RL was designed to work with sparse rewards, although I agree that dense rewards may help. Also terminating the episode when done may be a good idea - I guess data coming after "done" may be very noisy and meaningless.
try modify tMax - if tMax is as long as the entire episode, the problem should me much closer to a supervised problem with noiseless data (at least for the estimation of the V value), and you should converge.
In our experience, fitting the correct V value is in general easy - is this the case for your application too?
It seems that you convergence towards a very bad local minimum - I would have suggested changing the learning rate, as we have seen that this has a big influence on the convergence, but it seems that you already tried. Anyway, some more try with very small or very large learning rate may give some more hints.
If you haven't done it yet, increase the minimum training batch size for increasing the stability of the algorithm.

faustomilletari commented 7 years ago

first of all, i would like to thank you for the interest in the problem i'm facing and for your kind answers. I am right now trying the delayed reward scheme.

I'm using reward +1 for correct "done" and reward -1 for wrong "done".

When I don't use epsilon greedy, but I rely on the usual exploration strategy, I quickly converge to the situation where the agent wonders endlessly (50 steps) without ever issuing "done". this is understandable. The same thing happens when i try with epsilon greedy (just for fun and also because it's difficult to find a good beta parameter actually).

Now i'm trying with 1 for correct "done" and -0.1 for wrong "done" hopefully encouraging "done" to be issued more often. I will update the thread here in order to also provide potentially useful information to future readers.

faustomilletari commented 7 years ago

Ok. Funny enough now it issues just "done" because it's more convenient now in average.

So it seems that the policy is collapsing to just one action when this action is the only one that gets rewarded (and sometimes penalized a little). This happens with both epsilon greedy and the policy distribution based exploration strategy.

This should not happen. We have the mechanism of the delayed reward in place exactly to address these kind of situations.

This can be due to three things:

The value function is not learned correctly. The mechanism of the delayed reward strongly depends on the value function, as in policy gradient in a3c we have a loss where we try to promote high ri + gamma*R - v(si). Possible causes: network has too few parameters (unlikely). Learning rate too big. There is some problem with the discounted reward somehow...
the network forgets. Even if there is an exploration strategy it's sufficient to see a few episodes in a row exhibiting a certain behavior and training with it, and suddenly the networks thinks that the best thing to do is that action. I don't believe that lowering the learning rate would help.
something I'm unable to see right now

faustomilletari commented 7 years ago

Also my losses do not look smooth or descending at all. Have no clue why. My activation histograms look like this when beta = 0.01

beta 0 01

and like this when beta = 0.1

beta 0 1

I will try to use tensor board a bit more effectively and get a more info about the behaviors i'm seeing. hopefully somebody more expert than me (i consider myself still a novice in RL) will have some idea.

ifrosio commented 7 years ago

Based on the previous comment, your agent seems to be short-sighted. It cannot see rewards in a far future... Maybe because he is receiving rewards at every frame, and fall into a local minimum where collecting low-term reward is enough? If this is the case, both reducing the learning rate and increasing tMax may help. I am not sure if also changing gamma may have some effect - in theory a small gamma should favor a short-sighted agent, on the other hand a small gamma may provide a larger gradient for a high, occasional big reward (like the one you provide when it's done). BTW, providing the right reward is a problem per se - why not providing just +1 or -1 for any reward or, as I suggested before, providing the reward only when it's done?

As for the histogram, beta = 0.1 seems to perfectly regularize the output (as far as I understand you have 9 action, and the histogram here is concentrated around p = 0.11, which is coherent with this). In other words, I guess this agent is completely random, and therefore useless. The first histogram seems much better.

As for the smoothness of the curves, RL is really unstable and I only occasionally see a smooth convergence curve. Nonetheless, you can "regularize" this curves by increasing the "minimum training batch size" in the config file, and/or increasing tMax (again in the config file). Be careful, however: changing these parameters may also require changing the learning rate.

NVlabs / GA3C

Issues with learning in custom environment #26