Amazon_Fine_Food_Review

Amazon.com, Inc. is an American multinational technology company based in Seattle, Washington, which focuses on e-commerce, cloud computing, digital streaming, and artificial intelligence. The fine food data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. The dataset belongs to Stanford Network Analysis Project

31d13c99ee841869ca44ef54ba956272

1. What are the target of this project?

a. Business Acumen:

After understanding the Amazon Fine Food review dataset, as a data scientist, I have few questions to set up the outline that helps me dive into the project.

What is the connection between the food review score with the reviews and the products?

Any correlation between products and top users who often write reviews?

Can I extract the top product based on users’ recommendation?

What are the top words that help business to understand whether it is a good review or not?

Can I predict the positive and the negative reviews?

For building a better prediction, should I choose machine learning algorithms or deep learning model?

Those questions help me to separate the dataset into two parts:

One is the correlation of userID, producID, review score to bring up the business solution: recommend food product item

Another is the correlation of plaintext reviews with the sentiment analysis

b. Target:

Analyzing the top review, top product, top user for fine food
Applying Sentiment Analysis to analyze the plaintext review

Words in Positive Reviews download (6)

Word in Nevative Reviews download (7)

2. My solution

a. Create a Recommendation system based on Sparse Matrix for fine food

EDA based on UserId, ProductId, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time
Utilizing Descriptive Analysis
Applying Sparse Matrix to define the recommended selection
Evaluating my recommdation system with MSE

b. Sentiment Analyse

EDA / Text cleaning
Applying BoW, Word2Vec, and TD-IDF as a feature engineering to vectorize the text
Applying machine learning algorithms model (Bernoulli Naive Bayer, Logistic Regression) to predict the sentiment
Evaluating model based on log loss and accuracy scores
Selecting top words by apply K-means clustering method and plot it 2d and 3d with plot ly and t-SNE dimensionality reduction
Utilizing text preparation with Keras and applying deep learning (ANN and RNN-LSTM)
Evaluating model based on log loss and accuracy score
3. Outcome

a. Recommendation system with Sparse Matrix scipy.sparse.linalg.svds

a.1. Building Popularity Recommender system

Screen Shot 2021-02-01 at 12 18 47 AM

Since this is a popularity-based recommender model, recommendations remain the same for all users
We predict the products based on the popularity. It is not personalized to particular user

a.2. Building Collabrating Filtering

Model-based Collaborative Filtering is a personalised recommender system, the recommendations are based on the past behavior of the user and it is not dependent on any additional information.

Screen Shot 2021-02-01 at 12 13 01 AM

Based on the real value and the predict value, it is clear to see that the predictive recomendation system is great
The Popularity-based recommender system is non-personalised and the recommendations are based on frequecy counts, which may be not suitable to the user.You can see the differance above for the user id 70 and 100, The Popularity based model has recommended the same set of 5 or 6 products to both but Collaborative Filtering based model has recommended entire different list based on the user past purchase history

b. Sentiment Analysis

b.1 Machine Learning Algorithms

Accuracy
Log Loss
TF-IDF is the best model that has the highest accuracy score for both NernoulliNB and Logistic regression

Screen Shot 2021-02-01 at 12 26 21 AM

Logistic Regresison is the best model that fit in this dataset because it bring the highest accuracy score with the lowest log loss
Tuning model, the best parameters set fo Logistic Regression is lr__C: 100.0, lr__penalty: 'none', lr__solver: 'saga' b.2 Clustring the top words with K-mean
The best number of cluster is 3 with the highest Silhoutte Score is 0.002
Top 10 words are ['strong', 'br', 'cup coffee', 'tea', 'taste', 'like', 'cups', 'flavor', 'coffee', 'love', 'one', 'food', 'good', 'cup', 'great', 'bold', 'br br', 'product']

newplot

b.3 Deep Learning

Developing both ANN and RNN-LSTM, LSTM is the best one with the best accuracy score is 96%

download download (1)

4. Conclusion

Amazon Fine Food Review dataset is the incredible one. It allows me to utilized all my skills: statistical analysis, supervised learning method, unsupervised learning method, machine learning algorithm, deep learning.
From this dataset, I learn that with deep learning, everything is so simple. Kereas class with tokenzie to vectorize word is faster than TF-IDF traditional method. Also, RNN-LSTM utilized the plaintext review vectorized to bring up the better accuracy score for future sentiment analysis
My work is useful for all type of e-commerce because it can apply for both strategy team and customer service team to help the business to be better.

apham15 / Amazon_Fine_Food_Review

readme

Amazon_Fine_Food_Review

1. What are the target of this project?

2. My solution

3. Outcome

4. Conclusion