Texera / texera

Collaborative Machine-Learning-Centric Data Analytics Using Workflows
https://texera.github.io
Apache License 2.0
163 stars 72 forks source link

Movie Recommender System #1291

Closed drewli815 closed 3 years ago

drewli815 commented 3 years ago

Using the production server, we are conducting a data science project in hopes of building a working movie recommendation system

Make0913 commented 3 years ago

The objective of our system is to predict the genres of some movies that we don't know their exact genres, by using some other features of those movies, like posters or overview. Then we recommend such predicted movies to different users according to the users preference table.

To build our system, our collaborated project mainly consists of three steps.

  1. collect movie image dataset and text dataset separately, build the users preference table by analyzing the relationship of different tables in the text dataset, and save the users preference table
  2. we separately build two models for parallel prediction. Andrew build the text classification model and train it with text dataset, Make build the image classification model and train it with image dataset. After training, we both save our trained models on disk
  3. we load the trained models, we use our respective model to predict genres on test movies, then we load the users preference table in step1, we look up the preference table, and recommend the predicted movies to users who like them. This is our respective recommendation results, finally we combine our results by hash join to get the comprehensive results
Make0913 commented 3 years ago

step1-use the text dataset to build users preference table and save the outputs file URL:https://texera.ics.uci.edu/workflow/151

截屏2021-09-03 下午9 44 51
Make0913 commented 3 years ago

step2-Make- build CNN model for image classification in pytorch environment and save the trained model URL:https://texera.ics.uci.edu/workflow/176

截屏2021-09-03 下午10 59 09
Make0913 commented 3 years ago

step3-predict test movies with trained model and recommend them to different users, then combine our results URL:https://texera.ics.uci.edu/workflow/164

截屏2021-09-03 下午11 06 08
drewli815 commented 3 years ago

Step 2: Text Classification URL: https://texera.ics.uci.edu/workflow/163

For my text classification model, I focused in the overview feature, which contained a brief description of the specific movie. I first performed some aggregation and combined the relevant dataframes into one. Using a Python UDF operator I performed some basic text cleaning, in order to make the overview column more adaptable for the machine learning model. Next, I wrote a script for the genre classification, using a logistic regression model. For my output table, I displayed the f1-score and precision of my model. One limitation I faced was the ability to transfer model into another workflow, which we settled by saving the model to disk. image

Yicong-Huang commented 3 years ago

Thanks for the awesome use case! We will archive this issue now. The workflows and data files are archived on texera account project.