UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: GROUP_10: Online Shoppers Purchasing Intention #5

Open ytz opened 2 years ago

ytz commented 2 years ago

Submitting authors: @nicovandenhooff, @arijc76, @ytz

Repository: https://github.com/UBC-MDS/online-shoppers-purchasing-intention Report link: https://ubc-mds.github.io/online-shoppers-purchasing-intention/intro.html Abstract/executive summary: The research question that we are attempting to answer with our analysis is a predictive question, and is stated as follows:

Given clickstream and session data of a user who visits an e-commerce website, can we predict whether or not that visitor will make a purchase?

Nowadays, it is common for companies to sell their products online, with little to no physical presence such as a traditional brick and mortar store. Answering this question is critical for these types of companies in order to ensure that they are able to remain profitable. This information can be used to nudge a potential customer in real-time to complete an online purchase, increasing overall purchase conversion rates. Examples of nudges include highlighting popular products through social proof, and exit intent overlay on webpages.

Our final model is a tuned random forest, outputting 268 false positives, and 88 false negatives. The macro average recall score is 0.827 and the macro average precision score is 0.748, which is above our budget of 0.60 that we set at the beginning of our project.

Editor: @flor14 Reviewer: @Sanchit120496, @MacyChan, @shivajena

Sanchit120496 commented 2 years ago

Data analysis review checklist

Reviewer: @Sanchit120496

Conflict of interest

Code of Conduct

General checks

Code quality

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

MacyChan commented 2 years ago

Data analysis review checklist

Reviewer: @MacyChan

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

~90 mins

Review Comments:

This is a practical topic and I can totally see the possibility of real life use case in the e-commerce industry.

  1. Classification question is clearly defined with reason of interests and motivation.
  2. Clear project plan, tools and procedures on how to achieve your final result. Any reason you picked those models?
  3. It would be better to have some visualisation in Introduction - Data Cleaning, for example, visualising the outliners. (I guess it also relates to the Data Analysis - Distribution part, but it is not clearly pointed out)
  4. Even though ReadMe has written the data structure and explanations, it is hard to picture the data that you are studying. The EDA (big correlation graph/ bar chart) is a little bit overwhelmed. It would be nice to pick some important features and explain them in details as well.
  5. The Model selection part is easy to follow. A nitpicked comment maybe the best score be highlighed among models/ have some indicators/ visualisation to show the difference of the scoring.
  6. I appreciate the Statement of future direction session. I know what to look forward to in the upcoming release. =)

Since @Sanchit120496 focused on the script, I spent more time on the reading material

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

shivajena commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5

Review Comments:

I enjoyed reading the report which is very well structured and highlights the importance of the analysis along with its practicality. For e.g., in the model selection metrics, the linking of focus areas on the errors with business context was excellent and as an ex-management consultant, I cannot stress enough how important this is to convince a decision making process at the senior management levels. After having a review of the work, here are my observations on some of the sections:

  1. Feature Engineering: While it is good to see the new features created, they need a bit more explanation in terms of rationale behind the process or why it was needed in the first place. Further, in the analysis part, they can be evaluated on whether they are statistically meaningful to add them through anova. This is a bit far fetched thing, but could be tried to add much more credibility.
  2. Model Selection: Hyperparameter tuning wasnt done for different models, and as such, the models were set to their default hyperparameter values. In such scenario, an individual model particularly SVC (not logistic reg) may not be compared appropriately with tree based classifiers such as XGBoost & RF, which create multiple sub-trees and optimise the fit. Although it might be computationally very intensive, best hyper-parameters across models could be tried. Because, in most of the cases, RF will automatically stand as the best classifier by this approach!
  3. The story telling in the EDA can be bit more aligned towards the next step. While this has been attempted at atomic levels, but there can be an EDA summary section briefing the whole message in a crisp manner, leading to the next section. In other words, rather than putting key observations in subsections, you can summarise them briefly in one section for better comprehension.
  4. In presenting the distributions of the features, the x axis scale can be truncated to ignore extreme values and visualise important features in the distribution such as the extent of class imbalance and the central tendencies.
  5. The model tuning and results section comes to a bit of an abrupt end without explaining future scope on the classifications or what were the limitations of the analysis or what else can be done for better predictions.
  6. The authors name are missing in the report which may be added.

Rest, I think this is one of the best reports i have read, and commendable efforts put in here. I must say I learnt quite a lot from your analysis, such as smart use of feature engineering for one. All the best.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

ytz commented 2 years ago

😊 Thanks for the feedbacks!

  1. Regarding point no.6 from this comment, we have added our names to the report, as seen in https://github.com/UBC-MDS/online-shoppers-purchasing-intention/commit/91b1c6780d6c90717a289893d667e91634875099
  2. Regarding point no.3 from this comment, we have summarized key observations under data analysis in the report, as seen in https://github.com/UBC-MDS/online-shoppers-purchasing-intention/commit/96be4d8f273d9258b5ab949fc8e1489c76c0d909
  3. Regarding point no.1 from this comment, we have added the missing folders, using .gitkeep instead of a dummy text file, as seen in https://github.com/UBC-MDS/online-shoppers-purchasing-intention/commit/6c0383e1cfe4846736afeb6887443c8720b29780
  4. Regarding point no.5 from this comment, we have added a conclusion in our report, as seen in https://github.com/UBC-MDS/online-shoppers-purchasing-intention/commit/81c701f1b4da36fa2914972b01f14ed0959ad5ee