Closed miracle24 closed 4 years ago
But I will consider changing it as you said. Probably that will be a better solution
Alright, about to release a commit to this.
The solution is simple: substitute the historical action with the one generated by the beta network.
Why do we need it? Because we need the probabilities. P.S. Historical a_t is just a number, movie_id if you like. For other stuff, just a plain number is not enough, we need at least something like softmax probs or log_prob. You can translate that historical number into one hot, but I think beta is less hustle and is just better. Anyway, the idea is the same:
Can we do that? Since the entire purpose of the beta is to emulate the historical action, it can give us the probabilities we need. As well as that historical accurate action you were talking about.
I have addded a parameter called 'action_source' and so far I came up with:
class DiscreteActor(nn.Module):
def __init__(self, input_dim, action_dim, hidden_size, init_w=0):
super(DiscreteActor, self).__init__()
....
# What's action source? See this issue: https://github.com/awarebayes/RecNN/issues/7
# by default {pi: pi, beta: beta}
# you can change it to be like {pi: beta, beta: beta} as miracle24 suggested
self.action_source = {'pi': 'pi', 'beta': 'beta'}
def pi_beta_sample(self, state, beta, action, **kwargs):
....
# 3. sample the actions
# See this issue: https://github.com/awarebayes/RecNN/issues/7
# usually it works like:
# pi_action = pi_categorical.sample(); beta_action = beta_categorical.sample();
# but changing the action_source to {pi: beta, beta: beta} can be configured to be:
# pi_action = beta_categorical.sample(); beta_action = beta_categorical.sample();
available_actions = {'pi': pi_categorical.sample(), 'beta': beta_categorical.sample()}
pi_action = available_actions[self.action_source['pi']]
beta_action = available_actions[self.action_source['beta']]
policy_net = recnn.nn.DiscreteActor(1290, num_items, 2048).to(cuda)
# as miracle24 has suggested https://github.com/awarebayes/RecNN/issues/7
# you can enable this to be more like the paper authors meant it to
# policy_net.action_source = {'pi': 'beta', 'beta': 'beta'}
I have released a commit to this, you can look at it here
Write me back if I can close the issue
Actually, I do not know which one is correct or better. Maybe I can try both in my task. Thanks a lot.
Hello. Thanks for your great work, from which I learned a lot about reinforcement learning. I am confused about the computation of the correction weight in the importance sampling.
According to the paper "Top-K Off-Policy Correction for a REINFORCE Recommender System", the correction weight is , in which, I think, is an action sampled from the behavior policy, i.e. , and thus the reward of a sequence is corrected by dividing the likelihood of action given in the updated policy. i.e. by the likelihood of the same action given in . Noted will be input into both and . However, in your implementation, I found that actions are not the same for and , seeing the function "pi_beta_sample" in here.
Am I wrong about this?
Thanks!