awarebayes / RecNN

Reinforced Recommendation toolkit built around pytorch 1.7
Apache License 2.0
577 stars 114 forks source link

Question about the weight for correction in the importance sampling #7

Closed miracle24 closed 4 years ago

miracle24 commented 4 years ago

Hello. Thanks for your great work, from which I learned a lot about reinforcement learning. I am confused about the computation of the correction weight in the importance sampling.

According to the paper "Top-K Off-Policy Correction for a REINFORCE Recommender System", the correction weight is , in which, I think, is an action sampled from the behavior policy, i.e. , and thus the reward of a sequence is corrected by dividing the likelihood of action given in the updated policy. i.e. by the likelihood of the same action given in . Noted will be input into both and . However, in your implementation, I found that actions are not the same for and , seeing the function "pi_beta_sample" in here.

Am I wrong about this?

Thanks!

awarebayes commented 4 years ago

But I will consider changing it as you said. Probably that will be a better solution

awarebayes commented 4 years ago

Alright, about to release a commit to this.

The solution is simple: substitute the historical action with the one generated by the beta network.

I have addded a parameter called 'action_source' and so far I came up with:

class DiscreteActor(nn.Module):
    def __init__(self, input_dim, action_dim, hidden_size, init_w=0):
        super(DiscreteActor, self).__init__()
        ....
        # What's action source? See this issue: https://github.com/awarebayes/RecNN/issues/7
        # by default {pi: pi, beta: beta}
        # you can change it to be like {pi: beta, beta: beta} as miracle24 suggested
        self.action_source = {'pi': 'pi', 'beta': 'beta'}

    def pi_beta_sample(self, state, beta, action, **kwargs):
        ....

        # 3. sample the actions
        # See this issue: https://github.com/awarebayes/RecNN/issues/7
        # usually it works like:
        # pi_action = pi_categorical.sample(); beta_action = beta_categorical.sample();
        # but changing the action_source to {pi: beta, beta: beta} can be configured to be:
        # pi_action = beta_categorical.sample(); beta_action = beta_categorical.sample();
        available_actions = {'pi': pi_categorical.sample(), 'beta': beta_categorical.sample()}
        pi_action = available_actions[self.action_source['pi']]
        beta_action = available_actions[self.action_source['beta']]

policy_net = recnn.nn.DiscreteActor(1290, num_items, 2048).to(cuda)
# as miracle24 has suggested https://github.com/awarebayes/RecNN/issues/7
# you can enable this to be more like the paper authors meant it to
# policy_net.action_source = {'pi': 'beta', 'beta': 'beta'}
awarebayes commented 4 years ago

I have released a commit to this, you can look at it here

Write me back if I can close the issue

miracle24 commented 4 years ago

Actually, I do not know which one is correct or better. Maybe I can try both in my task. Thanks a lot.