Inder Khera, Jenny Zhang, Jessica Kuo, Javier Martinez (alphabetically ordered)
In this study, we aim to develop a classification model using the logistic regression (LR) algorithm to predict whether a patient is expected to have diabetes or not. Our final model performed decent on an unseen test dataset, achieving an overall accuracy of 0.75. Out of 218 test cases, the model correctly identified 164. However, it made 46 incorrect predictions, of which, 19 are false positives - incorrectly classifying non-diabetic subjects to diabetic- and 35 are false negatives - fail to diagnose diabetes when the patient is actually diabetic. Such errors could either lead to unnecessary treatment or delayed treatment, with the latter having more serious consequences, so we recommend further refinement of the model before it is deployed for clinical use.
The data set that was used for the analysis of this project was created by Jack W Smith, JE Everhart, WC Dickson, WC Knowler, RS Johannes. The data set was sourced from the National Library of Medicine database from the National Institues of Health. Access to their respective analysis can be found here and access to the dataset can be found via kaggle (Dua & Graff,2017). Each row/obersvation from the dataset is an individual that identifies to be a part of the Pima (also known as The Akimel O'odham) Indeginous group, located mainly in the Central and Southern regions of the United States. Each observation recorded has summary statistics regarding features that include the Age, BMI, Blood Pressure, Number of Pregnancies, as well as The Diabetes Pedigree Function (which is a score that gives an idea about how much correlation is between person with diabetes and their family history).
The final report can be found here or this webpage.
To replicate this analysis, follow the steps below. You can run the analysis using one of two methods: Docker or Conda.
Prerequisites: Please note that the instructions in this section require executing them in a Unix-based shell.
First, clone this GitHub repository and navigate to its root directory:
git clone https://github.com/UBC-MDS/diabetes_predictor_py.git
cd diabetes_predictor_py
Prerequisites: Install Docker and ensure it is running on your system.
Build and run the Docker container using the provided script:
chmod +x ./builders/docker_magic_builder.sh
./builders/docker_magic_builder.sh
This will set up the Conda environment inside a Docker container and build the Docker image.
Once the container is running, access the server by opening the link shown in the terminal (e.g., http://127.0.0.1:8888/lab?token={your_token})
Open the JupyterLab link generated in the terminal. Navigate to:
analysis/diabetes_analysis.ipynb
Under the Kernel menu, click:
Restart Kernel and Run All Cells...
Set up the Conda environment and run JupyterLab using the provided script:
chmod +x ./builders/conda_magic_builder.sh
./builders/conda_magic_builder.sh
Open:
analysis/diabetes_analysis.ipynb
Under Switch/Select Kernel, choose:
Python [conda env:diabetes_predictor]
Under the Kernel menu, click:
Restart Kernel and Run All Cells...
These steps ensure you can run the analysis seamlessly using either Docker or Conda.
Docker: Type Ctrl
+ C
in the terminal where you launched the container,
and then type docker compose rm
to shut down the container and clean up the resources
Conda: Type Ctrl
+ C
in the terminal where Jupyter Notebook is launched,
type conda deactivate
to exit out of the project environment,
and then type conda env remove diabetes_predictor
to delete the environment and clean up the resources
environment.yml
Add the dependency to the environment.yml
file on a new branch.
If the package is pip
installed, it should also be added to Dockerfile
with command RUN pip install <package_name> = <version>
Run conda-lock -k explicit --file environment.yml -p linux-64
to update the conda-linux-64.lock
file.
Re-run the scripts above using the Docker or Conda option.
The Diabetes Predictor report contained herein are licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License See the license file for more information. If re-using/re-mixing please provide attribution and link to this webpage. The software code contained within this repository is licensed under the MIT license. See the license file for more information.
Dua, D., & Graff, C. (2017). Pima Indians Diabetes Database. UCI Machine Learning Repository. Retrieved from https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database/data.