A data analysis project aiming to predict purchasing intentions of online shoppers.
In this project, we compare 3 different algorithms with the aim of
building a classification model to predict purchasing intentions of
online shoppers given browser session data and pageview data. We also
tried each algorithm with both setting the weights so that the classes
are “equal” (class_weight="balanced"
) and SMOTE to see which approach
will better address the class imbalance issue. Given our dataset, random
forest classifier with SMOTE was identified as the best model compared
to support vector machine and logistic regression classifiers. Although
the model performed relatively well in terms of accuracy with a score of
0.89, its performance was less robust when scored using f1
metric.
Specifically, the model had an f1
score of 0.67 and mis-classified 336
observations, 146 of which were false negatives. The 146 incorrect
classifications are significant as they represent potential sources of
missed revenue for e-commerce businesses. Therefore, we recommend
improving this model prior to deployment in the real-world.
The data set used in this project is the “Online Shoppers Purchasing
Intention” dataset provided by the Gözalan
Group and used by Sakar et. al in
their analysis published in Neural Computing and
Applications
(Sakar et al. 2019). The data set was sourced from UCI’s Machine
Learning Repository (Dua and Graff 2017) at this
link.
The specific file used for the analysis found
here.
Each row in the data set contains pageview and session information
belonging to a different visitor that browsed
Columbia, an online e-commerce platform
based in Turkey. Each row also contains the target class REVENUE
, a
boolean flag indicating whether that user session contributed to revenue
or not. Pageview data includes the type of pages that the visitor
browsed to and the duration spent on each page. Session information
includes visitor-identifying data such as browser information, operating
system information as well visitor location and type.
The final report can be found here
To replicate this analysis, please follow these steps:
note - the instructions in this section also depends on running this in a unix shell (e.g., terminal or Git Bash)
To replicate the analysis, install Docker. Then run the following command at the command line/terminal from the root directory of this project:
docker run --rm -v /$(pwd):/home/online_shopping_predictor lephanthuymai/dsci_522_group_31:latest make -C /home/online_shopping_predictor all
To reset the repo to a clean state, with no intermediate or results files, run the following command at the command line/terminal from the root directory of this project:
docker run --rm -v /$(pwd):/home/online_shopping_predictor lephanthuymai/dsci_522_group_31:latest make -C /home/online_shopping_predictor clean
Create and activate a conda environment using the env.yaml
at the root
of this project by running the following command at the root directory
of the project. (Alternatively, you can manually install the
dependencies listed in the env.yaml
file)
conda env create --file env.yaml
conda activate online_shopping_prediction_env
Run the following commands at the command line/terminal from the root directory of this project:
make all
To reset the project to a clean state and re-run the analysis, run the following command at the command line/terminal from the root directory of the project:
make clean
Please see the project dependencies included in the root env.yaml file.
If you choose to not use Docker
to run the project, you also need to have Chromium 88 installed. This is a requirement for plot generation and saving through altair
.
All Online Purchasing Intention Predictor materials are made available under the Creative Commons Attribution 2.5 Canada License (CC BY 2.5 CA).