Questions about Topk REINFORCE

awarebayes / RecNN

Reinforced Recommendation toolkit built around pytorch 1.7

Apache License 2.0

574 stars 113 forks source link

Questions about Topk REINFORCE #8

Closed wwwangzhch closed 4 years ago

wwwangzhch commented 4 years ago

Hello, thanks for sharing! I have some questions about pi_beta_sample in models.py, you use this function in _select_action_with_TopK_correction, but it seems only sample one item each time? I am also confused by Equation 6 in the original paper, mylatex20200109_204056 as we want to sample a set of top k item, shouldn't it be mylatex20200109_204634 ? a_{t, i} represent the ith item at time t. I appreciate any comments for my question since it's been bothering me for a long time

awarebayes commented 4 years ago

I honestly don't know, I tried to make it like the paper's authors suggested (with one little tweak discussed here https://github.com/awarebayes/RecNN/issues/7). Although, this seems to be more logical. Have you tried to implement this? I can also test it to make sure it's doing the thing. And recently I've found that he algorithm really lacks a TopK normalizing term in prediction, thus it often gets stuck at recommending only one / several items. Thus, adding some diversity penalty or taking top k recommendations could be an improvement.

If you happen to implement this (optionally in another function), you can submit a commit here.

wwwangzhch commented 4 years ago

I don't implement this paper but the definition of action in my research is similar to this paper. More specifically, I also need to get a set of items according to the item score produced by the neural network. When selecting items, I use

        if deterministic:   # if testing
            w_p, w_idx = torch.topk(scores, K)
        else:    # training
            w_idx = torch.multinomial(scores, K)

then I use torch.gather to get the corresponding log probabilities of each selected item. After that, I add these log probabilities up at each step and multiply the R_t when updating the net. I found this paper because the example code of policy gradient are all only choose 1 item each step, so I wonder whether my method is correct or not. I don't add any correction factor in my code so my gradient is mylatex20200109_230045 a_{t, i} represents a selected item at time step t. I try to email the authors about my question, but until now, there has been no response.

awarebayes commented 4 years ago

Yes, they also didnt respond to my letter where I have notified them about my repo. Thanks for sharing, I will work on making the algo more stable and make sure to try this approach

wwwangzhch commented 4 years ago

Ok, I will keep watching this repository, please let me know if you have any new thought, and thanks for your sharing too.

almajo commented 4 years ago

In case someone else might come back to this at some point: I was wondering the same thing and I implemented it in the scenario where per slate only one action can/will be clicked anyway, hence when receiving feedback we know which item that feedback responds to.

I guess the authors did the same thing because this sounds like it:

(2) While the main policy head π θ is trained using only items on the trajectory with non-zero reward^3 , the behavior policy β θ ′ is trained using all of the items on the trajectory to avoid introducing bias in the β estimate.

with footnote 3 saying:

We ignore them in the user state update as users are unlikely to notice them and as a result, we assume the user state are not influenced by these actions