Submission: Group 6: Student Dropout Predictor

Submitting authors: @ranjitprakash1986, @caesarw0, @zchen156, @tiger12055

Repository: https://github.com/UBC-MDS/dropout-predictions Report link:https://github.com/UBC-MDS/dropout-predictions/blob/main/doc/The_Report_of_Dropout_Prediction.html Abstract/executive summary:

Academic performance/graduation in a population is an important factor in their overall employability which contributes towards economic development.Student Dropout given the factors on demography, socioeconomics, macroeconomics, and relevant academic data provided by the Student on enrollment. This prediction is important to understand the student’s academic capacity. This important knowledge can be used to identify key areas of development such as the development of socially disadvantaged communities, improvement of academic programs, development of educational funding programs, etc. This project will try to investigate the following research questions: Given a student with his/her demography, socioeconomics, macroeconomics, and relevant academic data, how accurately can we predict whether he/she will drop out of school?

Classification task performed through machine learning algorithms deals which recognizing and grouping ideas into categories. These algorithms are used to detect patterns within existing datasets to help classify unseen and upcoming data. In this project 3 classification algorithms, Naive Bayes, Logistic Regression, Random Forest Classifier were used on a real-life dataset, to solve a two class classification problem. The performance of these 3 algorithms was compared through the classification metrics of Recall. The Random Forest Classifier and Logistic Regression algorithms performed appreciably with a high recall score of 0.8 and 0.83 respectively. The selection and further optimization of the best performing algorithm is planned for the future milestones of this project.

Editor: @flor14 Reviewer: Rus Dimitrov, Jenit Jain, Morris Zhao, Ke Wang

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: mozhao0331

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

I really like the way in the final report, you guys include many plots to give audiences a visualized view of the results. For example, "Figure 8. Precision and Recall Curve." gives me a clear view of the three models' performance on different thresholds. However, the results folder contains too many things (e.g. png file, CSV file...). I think you can classify them into sub-folders. For example, model, cv_results, and eda_results.
You only have a "dropout_pred_env.yml" file, but do not include a clearly stated list of dependencies in the README. So it is better to have a dependencies section in README. Additionally, in README, you use conda deactivate before rendering the final report, it is better to include all necessary packages in your environment file.
During the EDA section, you guys drop one category (enrolled student). It is better to have some explanation for doing that. What will the effect of dropping a category be? You are losing 18% of your data by doing so. Maybe you can combine it with the category Graduate instead, and maybe name the new category others.
In the Modeling section, there is a statement "The classification as a Dropout is considered a True Positive in the context of this project.", I am not sure what you mean by that. I think you want to state that "The classification as a Dropout is considered Positive in the context of this project." Additionally, for the results like the number of data points, and the number of TP, it is better to use the inline R code.
You state that your best interest is to minimize the number of False Negatives. Thus, you guys are interested in high recall in the case. So you may also include a recall score as well in your "Table 2. Cross-Validation Results", but not only the f1-score.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Ruslan Dimitrov @RussDim

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Overall a very good project with a really interesting question. The dataset is chosen well with sufficient data to accomodate a meaningful analysis. The team has put in a lot of effort in order to get to develop a usable model. I really like that they use multiple classifiers to verify and double check the outcomes in order to guarantee robustness.

The comments below are mostly about technical things which if taken into account should be easy to fix.

EDA should be done on train_df only. There should be nothing done with the test dataset. See this note form milestone1 in Project proposal section: Note - Remember, if you have a predictive research question, it is essential that you separate your dataset before you do any analysis. To be clear, you should NOT do any analysis - including preliminary EDA - on your test data.
EDA should include a list of the catogorical value levels. you can take it from the dataset home site or evaluate like you've done for the target column.
Also in the EDA, it would help evaluate properties of the numeric features with a .describe() or another function, to give an indication of what type of scaling if any is needed in the modeling stage.
Still on the EDA - I get a few errors when trying to run your notebook after installing your environment. One of the errors is from Fig1. Target count Bar Plot. The error says: ValueError: count(Target) encoding field is specified without a type; the type cannot be inferred because it does not match any column in the data.
In your usage scripts you should add tests for wheather the destination folders exist or not and if not, create them. This comment comes from my experience with creating your environment from the yml and trying to run the bash data_analysis_pipeline.sh file from an empty folder containing only the .sh file, which resulted in errors informing me that the destination folders weren't created. Here is an example: $ bash data_analysis_pipeline.sh '<long path> \\testing\\download_data.py': [Errno 2] No such file or directory

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: brabbit61

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

In the EDA, you have mentioned The Logistic Regression model provides the best combination of Precision and Recall for specific thresholds that can be leveraged to further improve the model performance, but this is not evident since there is considerable overlap between the PR curves for RandomForestClassifier and LogisticRegression. Please try to specify the particular threshold value you have decided to use for the classification task.
Cannot see the spread of values of the numerical features and the permissible categories for the categorical features.
During the model training and hyperparameter optimisation, max_features and max_depth are randomly selected from a range. You must set a seed for the same so that the code is reproducible.
All scripts contain custom function definitions, but there is more room for modularization. eg. Instead of using a for loop to train the models, you can create a _trainmodel() function that does the same job. This will improve the readability of the script by reducing the length of the functions and segragating tasks performed on a single model.
A figure for the feature correlation with respect to the target has been provided in the EDA and you've highlighted that there are more negatively correlated features than positive ones. No intuition or conclusion for the same has been provided.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: kenuiuc

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1h

Review Comments:

I'm not pointing out mistakes. Instead I'd like to give some suggestions to potentially further improve the code quality:

this conda env prefix is not needed
It would be better to move the proposal part of Readme into doc
The html file page doesn't render very well in the github code preview mode. Perhaps it would be better to publish it to Github Page. You can also change the report from html to md format so it can easily render on both Github Page and Github source code preview page.
I feel like this is kinda over documentation.
The readme doesn't have a comprehensive dependency list. I'm getting the following error when I run make. The python dependencies have been well handled by conda env, but it would be great to list the required R packages as well.

Quitting from lines 12-16 (The_Report_of_Dropout_Prediction.Rmd)
Error in library(kableExtra) : there is no package called 'kableExtra'
Calls: <Anonymous> ... withVisible -> eval_with_user_handlers -> eval -> eval -> library

Execution halted
make: *** [Makefile:36: doc/The_Report_of_Dropout_Prediction.html] Error 1

Amazing python script print out! Beautiful!!!

============================================================
        PR Curve Plotting
============================================================
============================================================
        PR Curve Plotting Completed
============================================================
============================================================
        ROC Curve Plotting
============================================================
============================================================
        ROC Curve Plotting Completed
============================================================
============================================================
        Confusion Matrix Plotting
============================================================
============================================================
        Confusion Matrix Plotting Completed
============================================================
           logisticRegression  NaiveBayes  RandomForestClassifier
Recall               0.833887    0.664452                0.803987
Precision            0.865517    0.888889                0.916667
F1                   0.849408    0.760456                0.856637
Accuracy             0.877410    0.826446                0.888430
============================================================
        Model Testing Completed - End of Testing
============================================================
python src/general_EDA.py --input_path="data/processed/train_eda.csv" --output_pat
h="results/"
============================================================
        Begin General EDA
=

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thanks, everyone for the feedback. We’re really grateful and appreciate you taking the time to share your thoughts with us. All the comments and suggestions are very useful and thoughtful. It could help us to accelerate the project to the next level. We will discuss the comments within our team, and hopefully reflect all the points mentioned if time permits.

Have a wonderful weekend everyone!

Thanks once again for the valuable comment on our project. After discussing with the team, we have made the following improvement in response to the feedback.

Project Clarification:

(already implemented before peer review, thus, no commit link)

The project is using recall score as the scoring metric in the CV result
- Target comment: Mozhao0331’s 5th comment
EDA is done using training data.
- Target comment: RussDim’s 1st comment
The team agreed to keep this code documentation for consistency
- Target comment: kenuiuc’s 4th comment

Environment Setup:

The package required to generate the report is added to the .yml file
- Commit: https://github.com/UBC-MDS/dropout-predictions/commit/da2a00c14422ecd6520a8c1a9a0139a3963f94fb
- Target comment: Mozhao0331’s 2nd & TA’s 1st comment
Removing conda prefix
- Commit: https://github.com/UBC-MDS/dropout-predictions/pull/65/commits/c31e0edd43ff4bb4be33491129862e571e37263b
- Target comment: kenuiuc’s 1st comment
Added R dependencies
- Commit: https://github.com/UBC-MDS/dropout-predictions/pull/77/commits/b970784f92e5ce3a18b12b0743dbf4a3bf6de9b1
- Target comment: TA’s 1st & 4th comment

Code Changes:

Added random state in the random forest model
- Commit: https://github.com/UBC-MDS/dropout-predictions/pull/70/commits/ee9c0c16ff0d0df347760618703f66e52e7578e4
- Target comment: brabbit61's 3rd comment

Report Changes:

Improvements were made to the framing of sentences in the report. The classification as Dropout is termed as a ‘positive’ in the context of this analysis.
- Commit: https://github.com/UBC-MDS/dropout-predictions/pull/75/commits/92205b4e40e3c925f27a564951378ae2119e7730
- Target comment: Mozhao0331 4th comment
The threshold used for Logistic Regression is the default of 0.5 as provided by Sklearn. This is explicitly conveyed in the report.
- Commit: https://github.com/UBC-MDS/dropout-predictions/pull/74/commits/65bf33ebb7f7e7bad654576a9ae57c7d7989a4af
- Target comment: brabbit61's 1st comment
Rephrasing positive/negative correlation sentence
- Commit: https://github.com/UBC-MDS/dropout-predictions/pull/73/commits/262c80920ec95225f0d1d3a1325e5d5bd64fa35f
- Target comment: brabbit61's 5th comment
Moving the longer version of the Jupiter notebook to archive
- Commit: https://github.com/UBC-MDS/dropout-predictions/pull/61/commits/79dd92a9b5f26a731557d905f24ea797916789ca
Stating the EDA scope in the report
- Commit: https://github.com/UBC-MDS/dropout-predictions/pull/63/commits/5555b1acad09a56c169673f8a1ab101f60dc405a
- Target comment: brabbit61's 2nd comment

UBC-MDS / data-analysis-review-2022