Submission: GROUP_10: Online Shoppers Purchasing Intention

Submitting authors: @nicovandenhooff, @arijc76, @ytz

Repository: https://github.com/UBC-MDS/online-shoppers-purchasing-intention Report link: https://ubc-mds.github.io/online-shoppers-purchasing-intention/intro.html Abstract/executive summary: The research question that we are attempting to answer with our analysis is a predictive question, and is stated as follows:

Given clickstream and session data of a user who visits an e-commerce website, can we predict whether or not that visitor will make a purchase?

Nowadays, it is common for companies to sell their products online, with little to no physical presence such as a traditional brick and mortar store. Answering this question is critical for these types of companies in order to ensure that they are able to remain profitable. This information can be used to nudge a potential customer in real-time to complete an online purchase, increasing overall purchase conversion rates. Examples of nudges include highlighting popular products through social proof, and exit intent overlay on webpages.

Our final model is a tuned random forest, outputting 268 false positives, and 88 false negatives. The macro average recall score is 0.827 and the macro average precision score is 0.748, which is above our budget of 0.60 that we set at the beginning of our project.

Editor: @flor14 Reviewer: @Sanchit120496, @MacyChan, @shivajena

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @Sanchit120496

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
- [x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
Documentation
- [x] Installation instructions: Is there a clearly stated list of dependencies?
- [x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
- [x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
- [x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
- [x] Style guidelides: Does the code adhere to well known language style guides?
- [x] Modularity: Is the code suitably abstracted into scripts and functions?
- [x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?
Reproducibility
- [x] Data: Is the raw data archived somewhere? Is it accessible?
- [x] Computational methods: Is all the source code required for the data analysis available?
- [x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
- [ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?
Analysis report
- [x] Authors: Does the report include a list of authors with their affiliations?
- [x] What is the question: Do the authors clearly state the research question being asked?
- [x] Importance: Do the authors clearly state the importance for this research question?
- [x] Background: Do the authors provide sufficient background information so that readers can understand the report?
- [x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
- [x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
- [x] Conclusions: Are the conclusions presented by the authors correct?
- [x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
- [x] Writing quality: Is the writing of good quality, concise, engaging?
Estimated hours spent reviewing: 2

Review Comments:

Nice work! This is really interesting and I thoroughly enjoyed reading the source of this analysis. The project looks pretty good, below are a few issues when I tried to replicate the process.

The main issue arrives in the usage section of the project:
1. The following folders are missing, data, data/raw, data/processed and results. One has to manually create them to get the scripts running
2. In the usage, please add the "python" word before every command line so that the user just has to copy paste the code
3. The usage code for model selection is wrong since there is no file in src named "ml_modelling". I think it's just a naming issue
4. After changing the file name to the correct one, the script throws an error in one of the assert statements becuase of which I cannot run the following scripts
5. Above the usage section if the user is given the exact command lines to create the environment, it will make the life of the user much easier
6. One suggestion I would make is to create placeholder folders for the ones mentioned in point 1

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @MacyChan

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

~90 mins

Review Comments:

This is a practical topic and I can totally see the possibility of real life use case in the e-commerce industry.

Classification question is clearly defined with reason of interests and motivation.
Clear project plan, tools and procedures on how to achieve your final result. Any reason you picked those models?
It would be better to have some visualisation in Introduction - Data Cleaning, for example, visualising the outliners. (I guess it also relates to the Data Analysis - Distribution part, but it is not clearly pointed out)
Even though ReadMe has written the data structure and explanations, it is hard to picture the data that you are studying. The EDA (big correlation graph/ bar chart) is a little bit overwhelmed. It would be nice to pick some important features and explain them in details as well.
The Model selection part is easy to follow. A nitpicked comment maybe the best score be highlighed among models/ have some indicators/ visualisation to show the difference of the scoring.
I appreciate the Statement of future direction session. I know what to look forward to in the upcoming release. =)

Since @Sanchit120496 focused on the script, I spent more time on the reading material

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5

Review Comments:

I enjoyed reading the report which is very well structured and highlights the importance of the analysis along with its practicality. For e.g., in the model selection metrics, the linking of focus areas on the errors with business context was excellent and as an ex-management consultant, I cannot stress enough how important this is to convince a decision making process at the senior management levels. After having a review of the work, here are my observations on some of the sections:

Feature Engineering: While it is good to see the new features created, they need a bit more explanation in terms of rationale behind the process or why it was needed in the first place. Further, in the analysis part, they can be evaluated on whether they are statistically meaningful to add them through anova. This is a bit far fetched thing, but could be tried to add much more credibility.
Model Selection: Hyperparameter tuning wasnt done for different models, and as such, the models were set to their default hyperparameter values. In such scenario, an individual model particularly SVC (not logistic reg) may not be compared appropriately with tree based classifiers such as XGBoost & RF, which create multiple sub-trees and optimise the fit. Although it might be computationally very intensive, best hyper-parameters across models could be tried. Because, in most of the cases, RF will automatically stand as the best classifier by this approach!
The story telling in the EDA can be bit more aligned towards the next step. While this has been attempted at atomic levels, but there can be an EDA summary section briefing the whole message in a crisp manner, leading to the next section. In other words, rather than putting key observations in subsections, you can summarise them briefly in one section for better comprehension.
In presenting the distributions of the features, the x axis scale can be truncated to ignore extreme values and visualise important features in the distribution such as the extent of class imbalance and the central tendencies.
The model tuning and results section comes to a bit of an abrupt end without explaining future scope on the classifications or what were the limitations of the analysis or what else can be done for better predictions.
The authors name are missing in the report which may be added.

Rest, I think this is one of the best reports i have read, and commendable efforts put in here. I must say I learnt quite a lot from your analysis, such as smart use of feature engineering for one. All the best.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

UBC-MDS / data-analysis-review-2021