Leavingseason / OpenLearning4DeepRecsys

Some deep learning based recsys for open learning.
398 stars 153 forks source link

Could you please provide the tool for transfering movielens data to pickle dump file? #3

Closed xray1111 closed 7 years ago

xray1111 commented 7 years ago

I'd like to evaluate the CCCFNet performance on movielens-1M, it would save me a lot of time if you could upload the data transfering tool. Thanks a lot!

Leavingseason commented 7 years ago

Sure. The data was provided by my colleague. I will ask him when he comes tomorrow.
I think it is not complicated, just some operations like reading original MovieLens data with Pandas and then write to a pkl file.

xray1111 commented 7 years ago

@Leavingseason Thanks! That would be a greate help.

Leavingseason commented 7 years ago

`import time import numpy as np from six import next import pandas as pd from sklearn.feature_extraction.text import CountVectorizer import scipy import pickle

import _pickle as cPickle

import codecs

def get_100k_data(): df = pd.read_csv(r"\e$\Users\v-fuz\Dataset\FlatFile\Recommendation_Dataset\MovieLens\ml-latest-100k\ratings.csv" , sep=',', engine='python') df["rating"] = df["rating"].astype(np.float32)

user_mapping = {}
movie_mapping = {}
index = 0
for x in list(df["userId"].unique()):
    user_mapping[x] = index
    index += 1
index = 0
for x in list(df["movieId"].unique()):
    movie_mapping[x] = index
    index += 1
df["userId"] = df["userId"].map(user_mapping)
df["movieId"] = df["movieId"].map(movie_mapping)
#for col in ("userId", "movieId"):
#    df[col] = df[col].astype(np.int32)

movies = pd.read_csv(r"\e$\Users\v-fuz\Dataset\FlatFile\Recommendation_Dataset\MovieLens\ml-latest-100k\movies.csv"
                 , sep=',', engine='python')
movies["movieId"]= movies["movieId"].map(movie_mapping)
movies = movies.set_index('movieId')
movies["genres"]= movies["genres"].map(lambda x: x.replace('|', ' ').lower())
#vectorizer = CountVectorizer(binary = True)
#vectorizer = vectorizer.fit(list(movies["genres"]))
#movies["genres"]= movies["genres"].map(lambda x: vectorizer.transform([x]))
movie_content = []
index_set = set(movies.index)
for i in range(len(movie_mapping)):       
    if i in index_set:
        movie_content.append(movies.loc[[i]].iloc[0]["genres"])
    else:
        movie_content.append('')
vectorizer = CountVectorizer(binary = True)
movie_content = vectorizer.fit_transform(movie_content)
movie_content = movie_content.astype(np.float32)

users = pd.read_csv(r"\\e$\Users\v-fuz\Dataset\FlatFile\Recommendation_Dataset\MovieLens\ml-latest-100k\tags.csv"
                 , sep=',', engine='python')
users["userId"]= users["userId"].map(user_mapping)
users = users.set_index('userId')
user_content = []
index_set = set(users.index)
for i in range(len(user_mapping)):       
    if i in index_set:
        user_content.append(' '.join(list(users.loc[[i]]["tag"])))
    else:
        user_content.append('')
user_content = vectorizer.fit_transform(user_content)
user_content = user_content.astype(np.float32)

#users = pd.DataFrame(users.groupby('userId')['tag'].agg(lambda x: ' '.join(x)).reset_index(name = "tags"))
#vectorizer = CountVectorizer(binary = True)
#vectorizer = vectorizer.fit(list(users["tags"]))
#users["tags"]= users["tags"].map(lambda x: vectorizer.transform([x]))

df = df.rename(columns={"userId":"user", "movieId":"item", "rating":"rate"})
rows = len(df)
df = df.iloc[np.random.permutation(rows)].reset_index(drop=True)
split_index = int(rows * 0.9)
df_train = df[0:split_index]
df_test = df[split_index:].reset_index(drop=True)

with codecs.open('movielens_100k.pkl', 'wb') as outfile:
    pickle.dump((df_train,df_test,user_content,movie_content), outfile, pickle.HIGHEST_PROTOCOL)

if name == 'main': get_100k_data() print("Done!")`

xray1111 commented 7 years ago

Wow! Thanks a lot! @Leavingseason