NicolasHug / Surprise

A Python scikit for building and analyzing recommender systems
http://surpriselib.com
BSD 3-Clause "New" or "Revised" License
6.28k stars 1k forks source link

Cross-validation kNN wrong results on custom dataset #462

Open julx134 opened 1 year ago

julx134 commented 1 year ago

Description

I am working on a capstone project that fits the item-based kNN on a custom Amazon appliance 100K dataset. I wanted to get the cross-validation metrics for this dataset, however, I am getting wildly incorrect results. To make sure my code wasn't a mistake, I ran the built-in MovieLens 100k dataset into my function and it returned valid results.

I've attached the datasets for your reference. amazon_appliance_100k.csv ml_100k.csv

Steps/Code to Reproduce

Here is the code to run and cross-validate a custom dataset on google collab:

def trainCustomDataset(path, num_folds):
  # path to custom dataset
  file_path = os.path.expanduser(path)

  #convert csv to dictionary
  rating_dict = {'user_id':[], 'item_id':[], 'rating':[]}
  with open(file_path, 'r') as dataset:
      for line in csv.reader(dataset):
          rating_dict['user_id'].append(line[0])
          rating_dict['item_id'].append(line[2])
          rating_dict['rating'].append(line[4])

  #convert dictionary to dataframe
  rating_df = pd.DataFrame.from_dict(rating_dict)

  #group duplicate values into one rating
  rating_df = rating_df.groupby(['user_id', 'item_id']).agg({'rating':'mean'}).reset_index()

  #define surprise reader object
  reader = Reader(rating_scale=(1,5))

  #convert dataframe into surprise dataset object
  data = Dataset.load_from_df(rating_df[['user_id', 'item_id', 'rating']], reader)

  # We'll use the item-based collaborative filtering algorithm
  sim_options = {
      "name": "cosine",
      "user_based": False,  # compute  similarities between items
  }
  #define IBCFRS
  algo = KNNBasic(sim_options=sim_options)
  algo.fit

  # Run 5-fold cross-validation and print results
  print(cross_validate(algo, data, measures=["RMSE", "MAE"], cv=num_folds, verbose=True))

Expected Results

My expected results should be similar to this: ML_100k_results

Actual Results

Here are my actual results: amazon_100k_result

Versions

Linux-5.10.147+-x86_64-with-glibc2.29
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0]
surprise 1.1.3