chyikwei / recommend

recommendation system with python
310 stars 118 forks source link

Running bmf with my data #8

Closed ghost closed 7 years ago

ghost commented 7 years ago

I have got a 31x9 matrix and I want to perform bmf through your code. Firstly, I read the matrix in the sparse format (180x3) as in the case of your example. Then, I calculate the max of the first and second col and trying to perform your code:

print n_user 31
print n_item 9
print n_feat 15
print ratings #numpy np.array

[[ 1  1 11]
 [ 1  5  7]
 [ 1  6 12]
...
 [31  5  7]
 [31  6  9]
 [31  8  9]]

#fit model
bpmf = BPMF(n_user=n_user, n_item=n_item, n_feature=n_feat,
                max_rating=15., min_rating=0., seed=0).fit(ratings, n_iters=20)
print RMSE(bpmf.predict(ratings[:, :2]), ratings[:,2]) # training RMSE

And I am receiving the following message: raise ValueError("max user_id >= %d", n_user) ValueError: ('max user_id >= %d', 31) What am I doing wrong? Actually it is working if I put n_user = 32 and n_item = 10. But does that make any sense? Furthermore the results of the bpmf.predict(ratings) are just the approximated values in my initial resutls. What about the rest of the values?

chyikwei commented 7 years ago

Hi, In my code, I assume both user/item index starts from 0 and looks like your id starts from 1. Shift id by 1 should work. like this:

ratings[:, (0, 1)] -= 1
ghost commented 7 years ago

Hey

really thanks, I really appreciate the help.

I have got one more question. The result of the factorization is a vector with the same size of the initial voting (my initial sparse matrix was 160x3 and the three dimensions are users-items-votings). The returned value (bpmf.predict(ratings)) return a vector sized 160x1. So it returns the approximated initial values. How can I see what is happening for the rest of the values. The total number of users are 31 and total number of items 9 so there are 279 possible votings. How can I see the approximated values of all possible votings?

Kind regards and really thanks again,

Christos

2017-02-16 4:35 GMT+01:00 Chyi-Kwei Yau notifications@github.com:

Hi, In my code, I assume both user/item index starts from 0 and looks like your id starts from 1. Shift id by 1 should work. like this:

ratings[:, (0, 1)] -= 1

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/chyikwei/recommend/issues/8#issuecomment-280221679, or mute the thread https://github.com/notifications/unsubscribe-auth/AFlDT9qeZrQrFA_lFmEDpYzGd6TJlF6lks5rc8PsgaJpZM4MASaW .

chyikwei commented 7 years ago

Hi, You can list all pairs you want to predict.

For example, if you want to predict user_id = 10 with all items, you can do:

>>> user_id = 10
>>> n_item = 9
>>> ratings = np.stack((np.repeat(user_id, n_item), np.arange(n_item)), axis=1)
>>> ratings
array([[10,  0],
       [10,  1],
       [10,  2],
       [10,  3],
       [10,  4],
       [10,  5],
       [10,  6],
       [10,  7],
       [10,  8]])
>>> bpmf.predict(ratings)
xyanggu commented 7 years ago

I let eval_iters = 50, I encountered a problem(RuntimeWarning: overflow encountered in multiply). With the number of eval_iters' increase, I only want to have the minimized RMSE. But I don't know how to solve the problem. Here is the question: recommend-0.1.0-py2.7.egg/recommend/pmf.py:86: RuntimeWarning: overflow encountered in multiply recommend-0.1.0-py2.7.egg/recommend/pmf.py:88: RuntimeWarning: overflow encountered in multiply recommend-0.1.0-py2.7.egg/recommend/pmf.py:89: RuntimeWarning: overflow encountered in multiply recommend-0.1.0-py2.7.egg/recommend/pmf.py:90: RuntimeWarning: overflow encountered in multiply recommend-0.1.0-py2.7.egg/recommend/pmf.py:97: RuntimeWarning: invalid value encountered in add recommend-0.1.0-py2.7.egg/recommend/pmf.py:104: RuntimeWarning: invalid value encountered in add recommend-0.1.0-py2.7.egg/recommend/pmf.py:133: RuntimeWarning: invalid value encountered in greater site-packages/recommend-0.1.0-py2.7.egg/recommend/pmf.py:136: RuntimeWarning: invalid value encountered in less INFO: iter: 24, train RMSE: nan INFO: iter: 25, train RMSE: nan INFO: iter: 26, train RMSE: nan INFO: iter: 27, train RMSE: nan ...

chyikwei commented 7 years ago

Close this issue since #9 is created.

kristosh commented 7 years ago

Really thanks for the toolbox, its easy to use and super nice. I have a question though. Is it easy to explain how exactly the predict functionality works, for the validation samples (both for BMF and PMF)? I mean you split your data into train and validation. When I have a small train_size (for example 5) in comparison to the validation set, then the RMSE for the validation set is small, Does it make sense?

chyikwei commented 7 years ago

Hi,

To predict rating for user i and item j, it is simply user_features[i] * item_features[j] + mean_rating.(user_features and item_features are the latent variables we learned during training.) (source)

And when the train size is small, RMSE for validation set should be large though.

kristosh commented 7 years ago

I am trying to use your example. I set train_pct = 0.001 and n_feature = 30 and then I got the following results: after 10 iteration, train RMSE: 1.120622, validation RMSE: 1.148221. Shouldnt the validation RMSE be higher?

chyikwei commented 7 years ago

I changed the pmf example with train_pct = 0.001 and n_feature = 30 and got this result:

n_user: 6040, n_item: 3952, n_feature: 30, training size: 1000, validation size: 999209
INFO: iter: 0, train RMSE: 1.117883
INFO: iter: 1, train RMSE: 1.097334
INFO: iter: 2, train RMSE: 1.070012
INFO: iter: 3, train RMSE: 1.034619
INFO: iter: 4, train RMSE: 0.987241
INFO: iter: 5, train RMSE: 0.923363
INFO: iter: 6, train RMSE: 0.839857
INFO: iter: 7, train RMSE: 0.733179
INFO: iter: 8, train RMSE: 0.604912
INFO: iter: 9, train RMSE: 0.474990
after 10 iterations, train RMSE: 0.474990, validation RMSE: 1.113232

Validation RMSE is much higher than training RMSE.

chyikwei commented 7 years ago

For bpmf, I can get similar result by increasing beta related parameters. I use beta=10., beta_user=10., beta_item=10., in the example, and get:

n_user: 6040, n_item: 3952, n_feature: 30, training size: 1000, validation size: 999209
INFO: iter: 0, train RMSE: 1.152316
INFO: iter: 1, train RMSE: 1.142085
INFO: iter: 2, train RMSE: 1.120228
INFO: iter: 3, train RMSE: 1.088892
INFO: iter: 4, train RMSE: 1.064241
INFO: iter: 5, train RMSE: 1.028266
INFO: iter: 6, train RMSE: 0.968250
INFO: iter: 7, train RMSE: 0.890073
INFO: iter: 8, train RMSE: 0.772950
INFO: iter: 9, train RMSE: 0.654719
after 10 iteration, train RMSE: 0.654719, validation RMSE: 1.253363
kristosh commented 7 years ago

Even when I have set train_pct = 0.00001 so the train size is ten then the RMSE for the validation is 1.310131. Seems somehow that the RMSE is approximately is always the same.

chyikweiyau commented 7 years ago

Why you think RMSE should be higher? 1.31 is very large considering the rating values are between 1 and 5.

If you use 3.0 to predict every data point in the dataset, you only get 1.259.

>>> RMSE(np.repeat(3.0, 1000209), ratings[:, 2])
1.2594181530018158
kristosh commented 7 years ago

My main issue is the fact that I performed NMF using the sklearn implementation and then BMF and the results are way better with the BMF. Therefore I am trying to see if something is not working properly here. Thanks for the help and the information anyway.

chyikwei commented 7 years ago

Do you check max/min value for your NMF prediction? In BPMF, I set min/max rating in predict function. you might need to do the same thing in NMF before comparing the results.

chyikwei commented 7 years ago

close this now