Open tiger12055 opened 2 years ago
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
Overall a very good project with a really interesting question. The dataset is chosen well with sufficient data to accomodate a meaningful analysis. The team has put in a lot of effort in order to get to develop a usable model. I really like that they use multiple classifiers to verify and double check the outcomes in order to guarantee robustness.
The comments below are mostly about technical things which if taken into account should be easy to fix.
ValueError: count(Target) encoding field is specified without a type; the type cannot be inferred because it does not match any column in the data.
$ bash data_analysis_pipeline.sh '<long path> \\testing\\download_data.py': [Errno 2] No such file or directory
This was derived from the JOSE review checklist and the ROpenSci review checklist.
RandomForestClassifier
and LogisticRegression
. Please try to specify the particular threshold value you have decided to use for the classification task.max_features
and max_depth
are randomly selected from a range. You must set a seed for the same so that the code is reproducible.This was derived from the JOSE review checklist and the ROpenSci review checklist.
I'm not pointing out mistakes. Instead I'd like to give some suggestions to potentially further improve the code quality:
this conda env prefix is not needed
It would be better to move the proposal part of Readme into doc
The html file page doesn't render very well in the github code preview mode. Perhaps it would be better to publish it to Github Page. You can also change the report from html
to md
format so it can easily render on both Github Page and Github source code preview page.
I feel like this is kinda over documentation.
The readme doesn't have a comprehensive dependency list. I'm getting the following error when I run make
. The python dependencies have been well handled by conda
env, but it would be great to list the required R
packages as well.
Quitting from lines 12-16 (The_Report_of_Dropout_Prediction.Rmd)
Error in library(kableExtra) : there is no package called 'kableExtra'
Calls: <Anonymous> ... withVisible -> eval_with_user_handlers -> eval -> eval -> library
Execution halted
make: *** [Makefile:36: doc/The_Report_of_Dropout_Prediction.html] Error 1
============================================================
PR Curve Plotting
============================================================
============================================================
PR Curve Plotting Completed
============================================================
============================================================
ROC Curve Plotting
============================================================
============================================================
ROC Curve Plotting Completed
============================================================
============================================================
Confusion Matrix Plotting
============================================================
============================================================
Confusion Matrix Plotting Completed
============================================================
logisticRegression NaiveBayes RandomForestClassifier
Recall 0.833887 0.664452 0.803987
Precision 0.865517 0.888889 0.916667
F1 0.849408 0.760456 0.856637
Accuracy 0.877410 0.826446 0.888430
============================================================
Model Testing Completed - End of Testing
============================================================
python src/general_EDA.py --input_path="data/processed/train_eda.csv" --output_pat
h="results/"
============================================================
Begin General EDA
=
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Thanks, everyone for the feedback. We’re really grateful and appreciate you taking the time to share your thoughts with us. All the comments and suggestions are very useful and thoughtful. It could help us to accelerate the project to the next level. We will discuss the comments within our team, and hopefully reflect all the points mentioned if time permits.
Have a wonderful weekend everyone!
Thanks once again for the valuable comment on our project. After discussing with the team, we have made the following improvement in response to the feedback.
(already implemented before peer review, thus, no commit link)
The project is using recall score as the scoring metric in the CV result
EDA is done using training data.
The team agreed to keep this code documentation for consistency
The package required to generate the report is added to the .yml file
Removing conda prefix
Added R dependencies
Improvements were made to the framing of sentences in the report. The classification as Dropout is termed as a ‘positive’ in the context of this analysis.
The threshold used for Logistic Regression is the default of 0.5 as provided by Sklearn. This is explicitly conveyed in the report.
Rephrasing positive/negative correlation sentence
Moving the longer version of the Jupiter notebook to archive
Stating the EDA scope in the report
Submitting authors: @ranjitprakash1986, @caesarw0, @zchen156, @tiger12055
Repository: https://github.com/UBC-MDS/dropout-predictions Report link:https://github.com/UBC-MDS/dropout-predictions/blob/main/doc/The_Report_of_Dropout_Prediction.html Abstract/executive summary:
Academic performance/graduation in a population is an important factor in their overall employability which contributes towards economic development.Student Dropout given the factors on demography, socioeconomics, macroeconomics, and relevant academic data provided by the Student on enrollment. This prediction is important to understand the student’s academic capacity. This important knowledge can be used to identify key areas of development such as the development of socially disadvantaged communities, improvement of academic programs, development of educational funding programs, etc. This project will try to investigate the following research questions: Given a student with his/her demography, socioeconomics, macroeconomics, and relevant academic data, how accurately can we predict whether he/she will drop out of school?
Classification task performed through machine learning algorithms deals which recognizing and grouping ideas into categories. These algorithms are used to detect patterns within existing datasets to help classify unseen and upcoming data. In this project 3 classification algorithms, Naive Bayes, Logistic Regression, Random Forest Classifier were used on a real-life dataset, to solve a two class classification problem. The performance of these 3 algorithms was compared through the classification metrics of Recall. The Random Forest Classifier and Logistic Regression algorithms performed appreciably with a high recall score of 0.8 and 0.83 respectively. The selection and further optimization of the best performing algorithm is planned for the future milestones of this project.
Editor: @flor14 Reviewer: Rus Dimitrov, Jenit Jain, Morris Zhao, Ke Wang