cvlab-stonybrook / Scanpath_Prediction

Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning (CVPR2020)
MIT License
103 stars 22 forks source link

Do strictly negative rewards lead to forced efficiency? #17

Closed Doch88 closed 3 years ago

Doch88 commented 3 years ago

You are using a log sigmoid as a reward activation for the PPO algorithm that trains the generator. That function has a minimum of zero (although it will never be touched), and so all the rewards are negative. Does the generator try to reduce at minimum the steps made (for example, finding the task item in 2 steps), in order to achieve a higher reward? And if so, is it good that the generator tries to obtain such efficiency instead of imitating the efficiency of the expert? Maybe using this reward function the generator learns to find efficiently an object in an image and it does not learn how to imitate a human scanpath.

ouyangzhibo commented 3 years ago

You are using a log sigmoid as a reward activation for the PPO algorithm that trains the generator. That function has a minimum of zero (although it will never be touched), and so all the rewards are negative. Does the generator try to reduce at minimum the steps made (for example, finding the task item in 2 steps), in order to achieve a higher reward? And if so, is it good that the generator tries to obtain such efficiency instead of imitating the efficiency of the expert? Maybe using this reward function the generator learns to find efficiently an object in an image and it does not learn how to imitate a human scanpath.

@Doch88 Great question and sorry for the late reply. Note that the generator stops when it hits the target. As far as I have observed, the generator does not tend to make short scanpaths. In fact, the predicted scanpaths are usually longer than what humans make. Hence, from this point of view, the model does not get rewarded for making shorter scanpaths. This is understandable because if the generator directly goes to the target, while the humans do not, that fixation (on the target) would be a very big negative reward. In this case, although there would be more steps by imitating humans, the negative rewards would be small whose sum would still be greater than that of the shorter scanpath.

Doch88 commented 3 years ago

immagine As you see in the cumulative probability graph in your paper, the value at step 1 is higher than the one made by the human. So, in some cases, your method is more efficient than the human. As you pointed out this is not always the case, because it's not always convenient to take fewer steps.

Let's take, for example, a reward value returned by discriminator of 0.6 for real examples and 0.4 for fake examples.

log(0.4) = -0.397 2 * log(0.6) = -0.443

So, in this case, making fewer steps is better for the generator. What I say holds for every x (x result of the discriminator for real examples, after sigmoid) that is less than 1/2(sqrt(5) - 1), ~0.618.

With discount factors of 0.9 and 0.99, the limit values should be ~0.609 and ~0.617.

ouyangzhibo commented 3 years ago

immagine As you see in the cumulative probability graph in your paper, the value at step 1 is higher than the one made by the human. So, in some cases, your method is more efficient than the human. As you pointed out this is not always the case, because it's not always convenient to take fewer steps.

Let's take, for example, a reward value returned by discriminator of 0.6 for real examples and 0.4 for fake examples.

log(0.4) = -0.397 2 * log(0.6) = -0.443

So, in this case, making fewer steps is better for the generator. What I say holds for every x (x result of the discriminator for real examples, after sigmoid) that is less than 1/2(sqrt(5) - 1), ~0.618.

With discount factors of 0.9 and 0.99, the limit values should be ~0.609 and ~0.617.

Agree! This is indeed one problem of current method. We are also working to address this issue by designing some termination criteria to let the model learn when it should stop searching. Ideally, this could prevent the generator from biasing to shorter scanpaths.