dawenl / vae_cf

Variational autoencoders for collaborative filtering
Apache License 2.0
530 stars 157 forks source link

A question about Negative sampling? #3

Open ConanCui opened 6 years ago

ConanCui commented 6 years ago

Hi, I have two questions about the paper that I can't understand to ask you for full sincerity, hoping you can give some detail or explanation about them.

The first is the assumption that multinomial distribution is better suited for ranking metrics, in other view multinomial distribution means the limited budget for probability mass, and the purchase of different goods is exclusive. But in some situation, the purchase of different goods is Not exclusive, for example, Buying a mobile phone and mobile phone case is not mutually exclusive.

The second is the experiment about Table 4 which compare the performance of different likelihood functions. As I know, most collaborative filter method using Gaussian likelihood functions and logistic likelihood functions with Negative sampling or weighting. In the equation (3), you have showed the Gaussian likelihood functions with c_{ui}, that means you only care about the entry 1, and as well as in equation (4). But the most important trick or method in recommendation, negative sampling which I can't find in Gaussian ,Logistic, Multinomial likelihood function. As I know the NCF[1] and CVAE[2] and many other method both use Negative sampling in their method (including Gaussian and Logistic) which can boost their performance. And I concern that the Multinomial likelihood function can't take the Negative sampling cause the it mathematical form. So I wonder can the Multinomial likelihood function beat the logistic likelihood functions with Negative sampling. And did you use NCF with Negative sampling ?

[1] Neural Collaborative Filtering ∗ Xiangnan,2017,WWW [2] Collaborative Variational Autoencoder for Recommender Systems, 2017,KDD

dawenl commented 6 years ago

1) Computing for probability mass doesn't necessarily mean exclusive. Multinomial can allow multiple non-zero entries. If two items tend to co-occur, the model can certainly learn to give both probability mass.

2) In Eq (3), the purpose of c{ui} is to downweight 0 (since c{ui} != 0 even when x_{ui} = 0), which is equivalent to negative sampling. I am not sure why you think it only cares about 1. I didn't do logistic with negative sampling because I didn't find it much more helpful. If you can get logistic with negative sampling with better results (all the necessary code should be available to you), please let me know and I am happy to include that in an updated version of the paper on arxiv. Whatever NCF used in the public source code is what I used.

ConanCui commented 6 years ago

The first problem I understand.

For the second problem, I can understand the situation c_{ui} != 0. I use your code and use the Logistic likelihood function with Negative sampling to replace the loss function as below, image

the omega^{+}_{u} means the set that contains the all positive observe sample of user u, omega^{-}_{u} means the set that contain negative samples randomly sampling from the interaction history of user u except the positive samples. And I note the ratio of the Negative sample as K which is equal (the number of omega^{-}_{u} / the number of omega^{+}_{u}).

I do some experiments using the Logistic likelihood function with Negative sampling in setting of different K. And I find the conclusion that the performance is improve by enlarge the K. And the performance of Logistic likelihood function reach best when take all the zero entry as negative samples. But the best performance of Logistic likelihood function is still worse than Multinomial.

And I find that there is a paper which take the variational auto-encoder with Logistic likelihood function[1], being similar with your work. And the Negative sampling improve his result a lot.

[1] Augmented Variational Autoencoders for Collaborative Filtering with Auxiliary Information,2017,CIKM. http://aai.kaist.ac.kr/xe2/module=file&act=procFileDownload&file_srl=18019&sid=4be19b9d0134a4aeacb9ef1ecd81c784&module_srl=1379

dawenl commented 6 years ago

I am not sure I follow, but for logistic isn't what I did is using all the 0's as negatives?

dawenl commented 6 years ago

I think I understand now, and maybe you misunderstood what I did -- for both Gaussian and logistic, I used all the 0's in the training. With Gaussian, I applied the c_{ui} weight which is in effect down-weighting all the negatives. With logistic, I simply used all the 0's, which I think corresponds to what you mean by setting K to the largest possible.

ConanCui commented 6 years ago

Hi, I have some doubt about how to apply your data split method in the baseline WMF. As I know, the data split method you use like below, image Each row in the matrix represent the interaction data for a user on all items. The interaction data in blue rectangle is used for train, the data in red rectangle is used for getting the necessary representation for test users , and the green is used for compute the NDCG. As I know, the WMF need know all the users in the process of training. How do you use this data split method for WMF as a baseline ?

dawenl commented 6 years ago

Your diagram looks correct. (One minor detail is that the splitting between red and green for each test user is random, not like certain items will only in red or green for all test users, so just to make that clear.)

I think there is only one sensible way to do it. Rather than me directly feeding you the answer, maybe you can think about it first and tell me how you would do it?

ConanCui commented 6 years ago

You are right, the split is random. To see it simply, I draw the diagram like the above. I have tried to use the data in blue and red rectangle to train the WMF, cause the data two together have all the users, then predict the result of green rectangle. But I am wondering if there is something wrong. Cause, with this train strategy, the interaction data of test set(red rectangle) influenced the learn-able parameters of WMF (user and items latent embeddings). This means I leak the test set in the training process. For VAE, Although the data in red rectangle is used to get the necessary representation for test user, the data doesn't have an influence on the learn-able parameters of VAE.

This is how I think, but I think training WMF like this exists some problems above. Is there anything wrong, and how do you do it?

dawenl commented 6 years ago

Yes, you are right that this would leak the validation data for WMF. A simple fix (this is how I did) is to train WMF only with the blue box and only keep the item factors. Then during evaluation, keep the item factors fixed and learn the validation user factors (which corresponds to one ALS update) with the red box and make prediction for the green box. This is known as strong generalization.

JoaoLages commented 5 years ago

I wonder why you didnt use Binary cross entropy over Cross entropy also. Since it is a multi-label problem. I also wonder why negative sampling or another technique wasnt applied since you vocabulary is very large.

JoaoLages commented 5 years ago

Also, in production, how do you represent new videos with this architecture?