Course recommendation engine using Apache Spark

ccqi commented 8 years ago

I have been thinking of this idea for a while, but never had the time to implement it until now.

Proposal

Since UW Flow is approaching 12000 users and growing, I think this a good time to setup a course recommendation engine.

After some research, I believe Apache Spark is a great tool to use for this purpose since it offers fast, parallelized cluster computing as well as a lot of rich tools and libraries for machine learning algorithms.

I also want to use Collaborative Filtering as the algorithm for the recommendation engine. The advantages of this algorithm is that it takes account of similar user's preferences to make recommendations.

Prototype Explanation

The prototype I made sends a list of triples (user_id, course_id, rating) to spark. The rating, for now, is our Liked It button. Spark uses the data to train a model using ALS (Alternating least squares). The model is trained to predict the course rating (0.0-1.0) of all courses not yet rated by the user. Then given a user and the trained model, it returns the top 5 highest courses predicted by the model as recommendations.

Running this PR

If you want to test this PR, you need to install and build spark (v1.4.0 or higher) and then set $SPARK_HOME to your spark install directory (uncomment some code in engine.py to set this up). Also install these python modules:

pip install numpy
pip install py4j

Then call

make recommendations user_id=<user_id>

user_id is the m.User.id property of each user
Next Steps

If we decide to go with this approach, here are some next steps and problems I think we need to solve:

Optimization In order to optimize the engine, we should run metrics against UW flow's real user data. (I am currently running the algorithm with some test data that I generated)

The most common step to evaluate learning algorithms is to split the available data into 3 sets: training (60%), validation (20%) and test (20%). Then we choose the optimal parameters based on the lowest error on the cross validation set. Finally, we measure the final error of the algorithm on the test set.

To do this, I need access to the entire or at least a subset of our user data.

Another possible optimization is to train the algorithm using more data (eg. shortlisted courses, easiness, usefulness)

Integration If we go with the approach, I suggest we integrate the engine with the application like this:

The model is trained every day with the new user data - run it with aggregator.py?
Whenever /user is loaded, front-end queries the recommender for the top "n" recommendations
Display the top "n" results to the user in the front-end, maybe we can reuse this design

Deployment We need to set-up spark for deployment.

The biggest advantage to using machine learning is that since our data is infinitely growing, our recommendations should also get much better over time.

JGulbronson commented 8 years ago

Hey Charles! Thanks for the PR, I'm going to give it a shot locally, and I'll comment with my thoughts.

JGulbronson commented 8 years ago

So I'm trying to test this locally, and am running in to an issue. (Note, I have production data on my machine, so it's actually training with real data)

I got spark by running brew install apache-spark, and am running your script with PYTHONPATH=.. pyspark data/engine.py userId where 'userId' is the id for my local profile.

However, it's throwing an error, saying ERROR:__main__:User has not rated any courses yet after it says the model has been trained. I've definitely rated courses, so any idea what's happening here?

I really love the idea, and it's great you put together a prototype PR so we can play around with it before putting up a formal one :)

Let me know when you get the chance, I want to get this working!

ccqi commented 8 years ago

Hey Jeff, thanks for taking the time to try out my PR!

As for the error you're getting, checkout my latest code, it will give you a more informative error message.

If you're still an error, reply back to me with the error message or message me on fb.

Can you also check that you rated "Liked It" for any courses? The spark library only gives recommendations to users that "Liked" or "Disliked" at least one course, and throws an error otherwise. I suspect this is what's happening with you.

JGulbronson commented 8 years ago

Alright, new error (and more informative!)

ERROR:__main__:'MatrixFactorizationModel' object has no attribute 'recommendProducts'

I double checked and I've liked courses, but this seems like a different error. Any thoughts? Would love to get this running, it'd be a great feature to have!

JGulbronson commented 8 years ago

I'm wondering if maybe it's a versioning issue? What version of Spark are you using?

JGulbronson commented 8 years ago

So it turns out it was a versioning issue, brew installed an old version of spark (1.3.1) that didn't have that method.

Now that that's figured out, I was able to run it successfully, and get recommendations! However, they were all courses I'd already taken, so I think we should show the top 5 that the user hasn't taken yet. I'm not quite sure what you meant in your original description by " Then given a user and the trained model, it returns the top 5 highest courses rated by the user as recommendations."

Does that mean the user has to rate a course to have it as a recommendation? Once we clear that up, then we can look at improving the algorithm and moving this from a prototype to a full-blown feature!

jlfwong commented 8 years ago

Amazing. So cool to see this kind of stuff come in as PRs.

Remember that you'll need to update https://github.com/UWFlow/rmc/blob/master/aws_setup/setup.sh to ensure that Spark gets installed on the server if you have to rebuild from scratch, as well as installing it for dev usage too: https://github.com/UWFlow/rmc/blob/master/linux-setup.sh and https://github.com/UWFlow/rmc/blob/master/mac-setup.sh

ccqi commented 8 years ago

@jlfwong glad to contribute :smile: @JGulbronson glad to see its working :smile:

I just updated my PR so that it will only recommend courses not already taken by the user, give it a try!

What I meant by

Then given a user and the trained model, it returns the top 5 highest courses rated by the user as recommendations.

is that the trained model will have the predicted ratings of all courses not taken by the user, and my script will return the top n highest predictions by the model. So if the model predicts for user x:

Course | Rating A | 0.21 B | 0.89 C | 1 (actual user rating) D | 0.78 E | 0.99 F | 0 (actual user rating)

and we want to recommend 3 courses, it will recommend E, B, D to the user.

For the case where the user liked no courses, the algorithm can't make a prediction since it does not have any information about the user. In this case, I think we should just recommend the most liked courses in our database to the user. This is not implemented in this PR yet.

divad12 commented 8 years ago

From my understanding, collaborative filtering's base case would recommend the most popular courses if there's no user-specific input, no?

Awesome stuff, btw. :)

On Sat, Dec 12, 2015 at 9:24 PM, Charles Qi notifications@github.com wrote:

@JGulbronson https://github.com/JGulbronson glad to see its working [image: :smile:]

I just updated my PR so that it will only recommend courses not already taken by the user, give it a try!

What I meant by

Then given a user and the trained model, it returns the top 5 highest courses rated by the user as recommendations.

is that the trained model will have the predicted ratings of all courses not taken by the user, and my script will return the top n highest predictions by the model. So if the model predicts for user x:

Course | Rating A | 0.21 B | 0.89 C | 1 (actual user rating) D | 0.78 E | 0.99 F | 0 (actual user rating)

and we want to recommend 3 courses, it will recommend E, B, D to the user.

For the case where the user liked no courses, the algorithm can't make a prediction since it does not have any information about the user. In this case, I think we should just recommend the most liked courses in our database to the user. This is not implemented in this PR yet.

— Reply to this email directly or view it on GitHub https://github.com/UWFlow/rmc/pull/254#issuecomment-164228166.

ccqi commented 8 years ago

@divad12 Yes, that's true. Unfortunately, the spark method I'm calling to make the recommendations throws an error whenever I input a user that has no ratings:

IN > model.recommendProducts(<user_id_with_no_ratings>, 5) OUT> ERROR:__main__:An error occurred while calling o35.recommendProducts. : java.util.NoSuchElementException: next on empty iterator

so we have to implement the base case ourselves.

ccqi commented 8 years ago

@JGulbronson I implemented the base case, so users that doesn't rate any courses will get recommended the highest rated courses with the most ratings. Give it a try when you have time, and let me know when we can start implementing this as a full blown feature.

JGulbronson commented 8 years ago

Awesome! I'm just finishing up work this week, but I should be able to pull and try it locally either tomorrow night or Friday at the latest.

In terms of implementing this as a full blown feature, I think there are a couple things that we'd need to do (in no particular order)

Train spark with real data

It'd be great to train the algorithm using real user data from Flow. I'm looking in to providing you with user data, with Personally Identifiable Information (PII) removed, such as names, emails, password hashes, etc. The courses taken, program, as well as friend graph would thus remain intact, which I think should be sufficient as a dataset for training, while also respecting the privacy of our users. @divad12 @jlfwong @jswu @mduan I'd like your input on this one, to make sure I'm handling things appropriately.

Front-end work

There needs to be some front-end component to this, so that users can actually see what we're recommending! At this point, it could be as simple as a view saying "Flow recommends these courses!" and then has links to 3-5 courses. The idea is we build it, get it out there, and then iterate on it. I'd be more than happy to take this on, as it shouldn't take me very long to complete, and we don't have to hold up the deploy. Which brings me to

Installing Spark on our server

This is something I'll have to do, and set up cron jobs for running the script. Should be fairly straightforward, I'll just have to make sure it doesn't drastically affect the server's performance while the cron job is running. I can look in to installing it on the server over the break. Note, you'd also want to add spark as a dependency to the project, though perhaps an optional one as it's not strictly necessary for getting everything running locally.

Clean up the code

I'll go through and give your existing code a review in the next couple days. Seems the library is pretty powerful, so there's not a lot of code (which is great), so we'll just make sure we have it nice and clean, with some good error reporting incase we encounter an edge case in production.

Code without tests is broken code

I really want all new code to have associated tests. It'd be great to have some here, perhaps some unit tests, as well as system tests if we're feeling particularly adventurous. We should have some examples already in the code base, but we can always talk about what we want to test in the Hipchat room.

Deploy and promote

Once we finish all the above stuff, I think it'll be good to merge/deploy. It'd be great to have it out at the start of next term, as some people might still be looking to add a new course in that first week of classes. That said, we certainly don't want to rush it, and make sure our first iteration is a functional feature that is of use to our users. Once it is deployed, then one of us will write a blog post, and I'll put it on all our social media accounts.

Alright, that was a lot of stuff. If you have any questions, let me know! Like I said, I'll look at getting you anonymized data, and giving you a code review to start. From there, we'll kinda check things off as they get done. Really liking the potential on this one, and I think after being trained with real user data, it'll be a great feature!

ccqi commented 8 years ago

@JGulbronson any chance to take a look at my PR?

ccqi commented 8 years ago

@JGulbronson Alright, I addressed your comments and make a couple of changes to integrate with the flow backend.

make train_engine will train the model and write user.json, course.json and trained_model to data/recommendations folder.
When the server starts up, it will load the data that were saved to data/recommendations folder into memory. Unfortunately, this means that our server will not run if we did not already train the model, as the files wouldn't exist. In this state, we wouldn't be able to make recommendations. What are you opinions on how to error handle this case? This also means every time we retrain the model, we need to reload the server to reload the updated model.
I added an api endpoint: /api/v1/user/<user_id>/recommendations. This will return the top n recommendations for the user and could be used by the frontend.

You can use curl to test this, although you need to authenticate the user first - you could uncomment the authenticate user code, although I am not sure if that will be ethical.
I moved all the spark and pythonpath dependencies to shell scripts instead of python.

Tell me what you think of the changes.

For next steps, I could look into writing some tests. I wrote some python nosetests for my last co-op, so they shouldn't take too long for me to write.

Also, are you able to provide me with real user data (without PII) anytime soon?

JGulbronson commented 8 years ago

So, I think we should run this as a cron job around midnight, and update/save the courses to each person's profile. This way, the API call won't cause issues, and we choose when Spark is taking up the processor. Does that make sense? Code wise, it's looking good, I'll get you that info without PII tomorrow. Sorry for the delay, it's been a busy week.

I think if you can make it so that it will (efficiently) run recommend_course for every User, that'd be great, and hopefully it'd only have to generate the matrix once. Does that make sense?

Again, sorry for the delay, you'll have the data tomorrow!

ccqi commented 8 years ago

@JGulbronson I agree, that's a much less riskier way of implementing recommendations.

I also made a prototype design for the front-end. I copied the general "courses" style (slightly smaller font) with the "add to shortlist" button on the right. @JGulbronson @jlfwong @divad12 @jswu @mduan Any feedback, comments, suggestions for improvements before I implement this?

divad12 commented 8 years ago

Cool!

What do you think about putting that beneath the course schedule?

On Sat, Jan 9, 2016 at 11:23 AM, Charles Qi notifications@github.com wrote:

I made a prototype design for the front-end. I copied the general "courses" style (slightly smaller font) with the "add to shortlist" button on the right. [image: image] https://cloud.githubusercontent.com/assets/6456601/12217854/ecda5b2c-b6db-11e5-863f-8c3afa2af253.png @JGulbronson https://github.com/JGulbronson @jlfwong https://github.com/jlfwong @divad12 https://github.com/divad12 @jswu https://github.com/jswu @mduan https://github.com/mduan Any feedback, comments, suggestions for improvements before I implement this?

— Reply to this email directly or view it on GitHub https://github.com/UWFlow/rmc/pull/254#issuecomment-170272794.

ccqi commented 8 years ago

@JGulbronson @divad12 I made a separate pull request for the front end, check it out!

ccqi commented 8 years ago

@JGulbronson Here are some updates I made recently and I believe this pull request is complete and ready to be deployed provided it passes final code review and testing.

I added a new field to the user model: recommended_courses and saved the recommendations to all users in the database with my script make train_engine. The script only takes 5-10 minutes on my local machine, so I expect the speed will be even better on the server.
I optimized some machine learning parameters with the real data.
I added some metric reporting for the engine, each time my script runs, it will write to a timestamped json file in "data/recommendations/metrics/." The file will contain the number of ratings and the training and test errors. If my model is effective, the error is expected to decrease with more ratings coming in. This is also useful information for anyone who wants to optimize the learning model in the future.
Finally, I added spark setup for linux development. My script should install spark (v1.6.0) on the root directory and setup some environment variables

As @jlfwong said, for deployment, we still need to set up spark for mac and aws.

For production, spark recommends we run the script through bin/spark_submit. Here are some documentation. For my script, this will be: PYTHONPATH=..:${SPARK_HOME}/python ${SPARK_HOME}/bin/spark-submit data/engine.py. I think we should we should run this as a cronjob every day/week. Also see the above documentation for additional parameters we might want to consider.

ccqi commented 8 years ago

@JGulbronson any time to take a look at this? Would like to get this merged before the term gets too busy (midterms and interviews)

JGulbronson commented 8 years ago

Going to take a look this weekend. Sorry for the wait!

JGulbronson commented 8 years ago

Would you be able to squash some commits together, and give me an idea of what dependencies this has? Then I'll look at getting them on the server.

ccqi commented 8 years ago

Ok I squashed all my commits.

Dependencies are: numpy==1.10.1 py4j==0.9 spark 1.6.0 with hadoop v2.6

Look at my linux-setup.sh changes to see how I installed spark for local development.

JGulbronson commented 8 years ago

Tests broke... it looks like some dependency changed?

ccqi commented 8 years ago

@JGulbronson Sorry tests were outdated (based on the old implementation where the engine was part of the API), they should be fixed now :)

jlfwong commented 8 years ago

@ccqi @JGulbronson The tests are failing with this message:

No module named pyspark

Which is unsurprising, because it relies upon dependencies that are not installed in the Docker image https://hub.docker.com/r/jgulbronson/uwflow/.

It seems like the right fix is to update the Docker image to have the missing dependencies (both Spark and pyspark), and push a new version of the Docker image.

UWFlow / rmc