Question about the weight for correction in the importance sampling

miracle24 commented 4 years ago

Hello. Thanks for your great work, from which I learned a lot about reinforcement learning. I am confused about the computation of the correction weight in the importance sampling.

According to the paper "Top-K Off-Policy Correction for a REINFORCE Recommender System", the correction weight is $\\frac{\pi(a_t|s_t)}{\beta(a_t|s_t)}$ , in which, I think, $\\\a_t$ is an action sampled from the behavior policy, i.e. $\\beta$ , and thus the reward of a sequence $\\tau$ is corrected by dividing the likelihood of action $\\\a_t$ given $\\\s_t$ in the updated policy. i.e. $\\pi$ by the likelihood of the same action $\\\a_t$ given $\\\s_t$ in $\\beta$ . Noted $\\\a_t$ will be input into both $\\pi$ and $\\beta$ . However, in your implementation, I found that actions are not the same for $\\pi$ and $\\beta$ , seeing the function "pi_beta_sample" in here.

Am I wrong about this?

Thanks!

awarebayes commented 4 years ago

But I will consider changing it as you said. Probably that will be a better solution

awarebayes commented 4 years ago

Alright, about to release a commit to this.

The solution is simple: substitute the historical action with the one generated by the beta network.

Why do we need it? Because we need the probabilities. P.S. Historical a_t is just a number, movie_id if you like. For other stuff, just a plain number is not enough, we need at least something like softmax probs or log_prob. You can translate that historical number into one hot, but I think beta is less hustle and is just better. Anyway, the idea is the same:
Can we do that? Since the entire purpose of the beta is to emulate the historical action, it can give us the probabilities we need. As well as that historical accurate action you were talking about.

I have addded a parameter called 'action_source' and so far I came up with:

class DiscreteActor(nn.Module):
    def __init__(self, input_dim, action_dim, hidden_size, init_w=0):
        super(DiscreteActor, self).__init__()
        ....
        # What's action source? See this issue: https://github.com/awarebayes/RecNN/issues/7
        # by default {pi: pi, beta: beta}
        # you can change it to be like {pi: beta, beta: beta} as miracle24 suggested
        self.action_source = {'pi': 'pi', 'beta': 'beta'}

    def pi_beta_sample(self, state, beta, action, **kwargs):
        ....

        # 3. sample the actions
        # See this issue: https://github.com/awarebayes/RecNN/issues/7
        # usually it works like:
        # pi_action = pi_categorical.sample(); beta_action = beta_categorical.sample();
        # but changing the action_source to {pi: beta, beta: beta} can be configured to be:
        # pi_action = beta_categorical.sample(); beta_action = beta_categorical.sample();
        available_actions = {'pi': pi_categorical.sample(), 'beta': beta_categorical.sample()}
        pi_action = available_actions[self.action_source['pi']]
        beta_action = available_actions[self.action_source['beta']]

policy_net = recnn.nn.DiscreteActor(1290, num_items, 2048).to(cuda)
# as miracle24 has suggested https://github.com/awarebayes/RecNN/issues/7
# you can enable this to be more like the paper authors meant it to
# policy_net.action_source = {'pi': 'beta', 'beta': 'beta'}

awarebayes commented 4 years ago

I have released a commit to this, you can look at it here

Write me back if I can close the issue

miracle24 commented 4 years ago

Actually, I do not know which one is correct or better. Maybe I can try both in my task. Thanks a lot.

awarebayes / RecNN

Question about the weight for correction in the importance sampling #7