Authors (in alphabetical order): Katherine Chen, Hancheng Qin, Yili Tang, Bill Wan
This project aims to build a machine learning model to predict student's academic success.
The Student Success Predictor project addresses the critical issue of academic dropout and failure in higher education. We are interested in investigating in the topic since both school dropout and educational failure in higher education are an obstacle to economic growth, employment, competitiveness, and productivity, which have a huge impact on the lives of students and their families, higher education institutions, and society as a whole. The ultimate goal of this project is to implement targeted strategies and support systems that contribute to the reduction of academic dropout.
Throughout the project, we built machine learning models, like Support Vector Machines (SVM), Random Forest, and Logistic Regression (with L1 and L2 regularization), to predict if a student might drop out.
Due to a large number of features and their inter-correlations, our initial models exhibited signs of overfitting. We therefore incorporated feature selection techniques such as Principal Component Analysis (PCA) and feature importance analysis, coupled with fine-tuning the models' parameters. The refined models demonstrated enhanced performance, evident in a minimized gap between training and validation accuracy. Among the three models, SVM marginally outperformed the others, achieving an accuracy of 80% and an AUC score of 0.89. However, there remains potential for further improvement in model performance through additional feature engineering and more comprehensive parameter tuning.
Our dataset is sourced from the UCI Machine Learning Repository.
In the src directory, you will find four Jupyter notebooks: data_analysis_final_report.ipynb, data_analysis_model.ipynb, data_analysis_EDA.ipynb, and data_analysis_parameter_optimization.ipynb. For a comprehensive view of the analysis, execute data_analysis_final_report.ipynb, which integrates all individual parts. If you're interested in the specifics of each analytical segment, the other notebooks can be run separately to explore each in more detail.
We have compiled our analysis into a comprehensive report, which can be accessed through this link. Our report includes several charts and visualizations that effectively aid in understanding the data patterns and analytical results. We welcome any feedback and suggestions you may have.
Docker is a container solution used to manage the software dependencies for this project. The Docker image used for this project is based on the quay.io/jupyter/minimal-notebook:2023-11-19 image. Additioanal dependencies are specified int the Dockerfile.
Setup:
git clone git@github.com:UBC-MDS/Student_Success_Predict_Group15.git
docker compose up
Running the analysis:
make clean
make all
Clean up:
docker compose rm
The Student Success Predictor materials here are licensed under under MIT License. If re-using/re-mixing please provide attribution and link to this webpage.
Please refer to the UCI Machine Learning Repository (https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success) for the dataset used in this project.