NicolasHug / Surprise

A Python scikit for building and analyzing recommender systems
http://surpriselib.com
BSD 3-Clause "New" or "Revised" License
6.37k stars 1.01k forks source link

A bug when importing data from DataFrame #455

Closed YiranZhang1014 closed 1 year ago

YiranZhang1014 commented 1 year ago

Description

When importing data using DataFrame, all estimated rating equal to the mean value, not really predict the rating. But if importing the same data set from file, it works as normal.

Steps/Code to Reproduce

import pandas as pd
from surprise import SVD
from surprise import Dataset
from surprise import Reader

# Creation of the dataframe. Column names are irrelevant.
ratings_dict = {'itemID': [1, 1, 1, 2, 2],
                'userID': [9, 32, 2, 45, 'user_foo'],
                'rating': [3, 2, 4, 3, 1]}
df = pd.DataFrame(ratings_dict)

# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(1, 5))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)

# We can now use this dataset as we please, e.g. calling cross_validate
svd_model = SVD()
svd_model.fit(trainset=data.build_full_trainset())
test_case = svd_model.predict(str(1),str(2),verbose=True)

Expected Results

The result were different if predict each user and item.

Actual Results

But the actual result was that all predict ratings equal to the mean value (2.6)

Versions

Windows-10-10.0.22621-SP0 Python 3.8.13 (default, Mar 28 2022, 06:59:08) [MSC v.1916 64 bit (AMD64)] surprise 1.1.3

YiranZhang1014 commented 1 year ago

I got some new discoveries on this issue. This problem was caused by the type of perimeters. When I imported the data form files, I needed to use str type, for example:

model.predict(str(1), str(2))

However, when I imported the data from a DataFrame, I needed to use int:

model.predict(1, 2)

I guess the question is when programme read a file, it regard all attribues as str, while reading a DataFrame will regard the attributes as the original types.

NicolasHug commented 1 year ago

when programme read a file, it regard all attribues as str, while reading a DataFrame will regard the attributes as the original types

You are correct @Alaskyed. There are more details in https://surprise.readthedocs.io/en/stable/FAQ.html#what-are-raw-and-inner-ids