apple / turicreate

Turi Create simplifies the development of custom machine learning models.
BSD 3-Clause "New" or "Revised" License
11.2k stars 1.14k forks source link

adding user_data in recommender breaks the param. exclude_known #3387

Open azzelena opened 3 years ago

azzelena commented 3 years ago

When I add user_data in turicreate.recommender.ranking_factorization_recommender.create (side information for the user) then the parameter exclude_known== True (exclude all user-item interactions previously seen in the training data) in turicreate.recommender.factorization_recommender.FactorizationRecommender.recommend doesn't work ( products that were in the training data are predicted).

what could be the problem?

TobyRoseman commented 3 years ago

You want exclude_known = True not exclude_known== True. You only want one equal sign not two.

azzelena commented 3 years ago

Sorry, there is one = , not two. But it is doesn't work anyway :(

TobyRoseman commented 3 years ago

I can not reproduce this issue. Excluding known seems to work just fine when using user side information.

The following code:

import turicreate

sf = turicreate.SFrame({
   'user_id': ["0", "0", "0", "1", "1", "2", "2", "2"],
   'item_id': ["a", "b", "c", "a", "b", "b", "c", "d"],
   'rating': [1, 3, 2, 5, 4, 1, 4, 3]
})

user_info = turicreate.SFrame({
   'user_id': ["0", "1", "2"],
   'numeric_feature': [0.1, 12, 22]
})

m = turicreate.factorization_recommender.create(
   sf, target='rating', user_data=user_info)
print(m.recommend([0,1,2], exclude_known=True))

prints:

+---------+---------+--------------------+------+
| user_id | item_id |       score        | rank |
+---------+---------+--------------------+------+
|    0    |    d    | 1.7338812722585795 |  1   |
|    1    |    c    | 3.8376356229560438 |  1   |
|    1    |    d    | 3.0225044876711427 |  2   |
|    2    |    a    | 2.613139470909195  |  1   |
+---------+---------+--------------------+------+
[4 rows x 4 columns]

None of these (user_id, item_id) pairs are in sf.

azzelena commented 3 years ago

That's right. But if we add user_ids that are not in sf, it breaks.


 import turicreate as tc

sf = tc.SFrame({
   'person_id': ["10055", "10055", "10055","200","200"],
   'product_id': ["a", "b", "c","y","u"],
})

item_data = tc.SFrame({
   'product_id': ["a", "b","g"],
   'category': ["10", "20","j"],
})

user_info = tc.SFrame({
   'person_id': ["1", "2", "10055", "200"],
   'fav_category': ["sample", "sample3", "sample", "sample"],
})

model = tc.recommender.ranking_factorization_recommender.create(sf,
                                                        user_id="person_id", random_seed=None,
                                                        item_id="product_id",
                                                        max_iterations=120,solver='adagrad', 
                                                        item_data=item_data, user_data=user_info,
                                                                verbose=True)

recommended = model.recommend(sf["person_id"].unique(), k=3, exclude_known=True) 

print(recommended)
print(sf)

prints:

+-----------+------------+-----------------------+------+
| person_id | product_id |         score         | rank |
+-----------+------------+-----------------------+------+
|    200    |     y      |   0.9996631073397743  |  1   |
|    200    |     u      |   0.9993447021791567  |  2   |
|    200    |     g      | 0.0022382634943607206 |  3   |
|   10055   |     a      |   0.9997131845839977  |  1   |
|   10055   |     b      |   0.9997102391956959  |  2   |
|   10055   |     c      |   0.999197361763928   |  3   |
+-----------+------------+-----------------------+------+
[6 rows x 4 columns]

+-----------+------------+
| person_id | product_id |
+-----------+------------+
|   10055   |     a      |
|   10055   |     b      |
|   10055   |     c      |
|    200    |     y      |
|    200    |     u      |
+-----------+------------+
TobyRoseman commented 3 years ago

You're right. Having user data for users that are not present in the observation data does break exclude_known=True.

I've verified your results and also verified that removing the first two rows of user_info causes things to work as expected.

@hoytak - This is very strange. Any idea what's going on here?

I guess the workaround here is simple. Run:

user_info = user_info.filter_by(sf["person_id"].unique(), 'person_id')

before calling tc.recommender.ranking_factorization_recommender.create.