Amazon Reviews - Githubissues

alfaraday commented 6 years ago

Looks good @dtherrick! Nothing stands out as needing to be changed in the example, but we'll need to be thoughtful about the way we turn this walkthrough into a challenge.

We've got the main ask at the top, which is to use Spark to determine if we can predict whether a review is positive or negative based on the language in the review.

How much information do you think we'll need to include to make that a reasonable ask?
What specific information should they include in their work?
How are they going about this work? Is it going to be done in a Jupyter notebook? Will they need to use Docker somehow? (And if so, I think they'll need more guidance on setting that up.)

Then at the bottom, there are some extension tasks:

Here's where you can go from here:

Think about resampling the overall dataset to better balance positive and negative reviews. Use a different method to tokenize and convert the text to numeric (TF/IDF, etc). Adjust the parameters of your classifier.

Do you want to incorporate those tasks into the challenge itself, or will they just live in the solution?

dtherrick commented 6 years ago

@alfaraday I think I understand where you're going - we need a second document that is the exercise itself. Essentially - this notebook is the solution to the problem.

Addressing your three bullet points above:

Information to include to reasonable answer the question: the dataset has what we need, plus we can provide some tips on how to approach the problem.
The solution should be in a Jupyter notebook so a mentor can review the steps the student took in order to get to a solution. They should explain why they chose a particular encoder for the reviews column, why they chose a particular classifier, and then evaluate the results of their model. If it doesn't perform well, try to answer why it didn't.
I would expect they use a similar setup to how I've created the notebooks so far: Docker container hosting Spark and Jupyter, and a repository with the actual notebooks they can push to Github:
1. Assuming they've set up Docker and have a notebooks repository on their machine (we may need to spend some time walking them through setting up that repo); they should use the docker run command to get a container going on their local machine. I use the following command: docker run -d --rm -p 8888:8888 -v /Users/damian/Documents/Code/ds-notebooks:/home/ds/notebooks thinkfulstudent/pyspark:2.2.1 to pull and run the thinkfulstudent pyspark image locally. The ds-notebooks repo sits on my machine and contains all my Jupyter notebooks.
2. The student would then solve the exercise in a Jupyter notebook, push it to Github (or whatever site they use). It's easy enough for a mentor to clone that repo, fire up the same container, and run the code to review it with the student.

As far as the extension tasks, I included those to spur students to think more like a professional data scientist. In other words, this solution should be a first-pass result. The extension tasks are there to help them think about how they would move on to the next steps to generate a production model.

This should be straightforward to pull from the Amazon Jupyter notebook - but I think we should discuss the best delivery approach for it. We can talk through it on Slack if needed also. Thanks!

alfaraday commented 6 years ago

@dtherrick Okay cool, this all sounds good! I'll take a stab at writing the challenge instructions and share it with you before we meet on Friday.

dtherrick / ds-notebooks

Amazon Reviews #1