NicolasHug / Surprise

A Python scikit for building and analyzing recommender systems
http://surpriselib.com
BSD 3-Clause "New" or "Revised" License
6.28k stars 1k forks source link

Raw IDs read from csv file are no longer strings?? #434

Closed Techie5879 closed 1 year ago

Techie5879 commented 1 year ago

I'm using the MovieLens small-latest dataset (https://grouplens.org/datasets/movielens/latest/), and reading the "ratings.csv" file into a pandas dataframe, then converting it into a Dataset, then making a trainset, and fitting the algorithm on it.

df= pd.read_csv("ratings.csv").drop("timestamp", axis=1)
data = Dataset.load_from_df(df, reader)
data = data.build_full_trainset()
final_algo.fit(data)

Now, running data._raw2inner_id_users, I should get a dict with keys as the raw ids of users and values as the inner ids of users. But using that gives

{1: 0,
 2: 1,
 3: 2,
 4: 3,
 5: 4,
 6: 5,
 7: 6,
 8: 7,
 9: 8,
 10: 9,
 11: 10,
 12: 11,
 13: 12,
 14: 13, 
...
}

As can be seen, the keys are not strings. However, from the docs,

Raw ids are ids as defined in a rating file or in a pandas dataframe. They can be strings or numbers. Note though that if the ratings were read from a file which is the standard scenario, they are represented as strings.

But the raw ids aren't strings here? Why so?

NicolasHug commented 1 year ago

That's because ids are already integers in the dataframe:

print(df.dtypes)

userId       int64
movieId      int64
rating     float64
dtype: object

Surprise will use the same types as pandas here

Techie5879 commented 1 year ago

Surprise will use the same types as pandas here

Thanks, it would be helpful if thats in the documentation though. I think the documentation said that raw ids are strings if Dataset is loaded from an external file like csv.