UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: Group 6: Student Dropout Predictor #9

Open tiger12055 opened 1 year ago

tiger12055 commented 1 year ago

Submitting authors: @ranjitprakash1986, @caesarw0, @zchen156, @tiger12055

Repository: https://github.com/UBC-MDS/dropout-predictions Report link:https://github.com/UBC-MDS/dropout-predictions/blob/main/doc/The_Report_of_Dropout_Prediction.html Abstract/executive summary:

Academic performance/graduation in a population is an important factor in their overall employability which contributes towards economic development.Student Dropout given the factors on demography, socioeconomics, macroeconomics, and relevant academic data provided by the Student on enrollment. This prediction is important to understand the student’s academic capacity. This important knowledge can be used to identify key areas of development such as the development of socially disadvantaged communities, improvement of academic programs, development of educational funding programs, etc. This project will try to investigate the following research questions: Given a student with his/her demography, socioeconomics, macroeconomics, and relevant academic data, how accurately can we predict whether he/she will drop out of school?

Classification task performed through machine learning algorithms deals which recognizing and grouping ideas into categories. These algorithms are used to detect patterns within existing datasets to help classify unseen and upcoming data. In this project 3 classification algorithms, Naive Bayes, Logistic Regression, Random Forest Classifier were used on a real-life dataset, to solve a two class classification problem. The performance of these 3 algorithms was compared through the classification metrics of Recall. The Random Forest Classifier and Logistic Regression algorithms performed appreciably with a high recall score of 0.8 and 0.83 respectively. The selection and further optimization of the best performing algorithm is planned for the future milestones of this project.

Editor: @flor14 Reviewer: Rus Dimitrov, Jenit Jain, Morris Zhao, Ke Wang

mozhao0331 commented 1 year ago

Data analysis review checklist

Reviewer: mozhao0331

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

  1. I really like the way in the final report, you guys include many plots to give audiences a visualized view of the results. For example, "Figure 8. Precision and Recall Curve." gives me a clear view of the three models' performance on different thresholds. However, the results folder contains too many things (e.g. png file, CSV file...). I think you can classify them into sub-folders. For example, model, cv_results, and eda_results.
  2. You only have a "dropout_pred_env.yml" file, but do not include a clearly stated list of dependencies in the README. So it is better to have a dependencies section in README. Additionally, in README, you use conda deactivate before rendering the final report, it is better to include all necessary packages in your environment file.
  3. During the EDA section, you guys drop one category (enrolled student). It is better to have some explanation for doing that. What will the effect of dropping a category be? You are losing 18% of your data by doing so. Maybe you can combine it with the category Graduate instead, and maybe name the new category others.
  4. In the Modeling section, there is a statement "The classification as a Dropout is considered a True Positive in the context of this project.", I am not sure what you mean by that. I think you want to state that "The classification as a Dropout is considered Positive in the context of this project." Additionally, for the results like the number of data points, and the number of TP, it is better to use the inline R code.
  5. You state that your best interest is to minimize the number of False Negatives. Thus, you guys are interested in high recall in the case. So you may also include a recall score as well in your "Table 2. Cross-Validation Results", but not only the f1-score.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

RussDim commented 1 year ago

Data analysis review checklist

Reviewer: Ruslan Dimitrov @RussDim

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Overall a very good project with a really interesting question. The dataset is chosen well with sufficient data to accomodate a meaningful analysis. The team has put in a lot of effort in order to get to develop a usable model. I really like that they use multiple classifiers to verify and double check the outcomes in order to guarantee robustness.

The comments below are mostly about technical things which if taken into account should be easy to fix.

  1. EDA should be done on train_df only. There should be nothing done with the test dataset. See this note form milestone1 in Project proposal section: Note - Remember, if you have a predictive research question, it is essential that you separate your dataset before you do any analysis. To be clear, you should NOT do any analysis - including preliminary EDA - on your test data.
  2. EDA should include a list of the catogorical value levels. you can take it from the dataset home site or evaluate like you've done for the target column.
  3. Also in the EDA, it would help evaluate properties of the numeric features with a .describe() or another function, to give an indication of what type of scaling if any is needed in the modeling stage.
  4. Still on the EDA - I get a few errors when trying to run your notebook after installing your environment. One of the errors is from Fig1. Target count Bar Plot. The error says: ValueError: count(Target) encoding field is specified without a type; the type cannot be inferred because it does not match any column in the data.
  5. In your usage scripts you should add tests for wheather the destination folders exist or not and if not, create them. This comment comes from my experience with creating your environment from the yml and trying to run the bash data_analysis_pipeline.sh file from an empty folder containing only the .sh file, which resulted in errors informing me that the destination folders weren't created. Here is an example: $ bash data_analysis_pipeline.sh '<long path> \\testing\\download_data.py': [Errno 2] No such file or directory

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

brabbit61 commented 1 year ago

Data analysis review checklist

Reviewer: brabbit61

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

  1. In the EDA, you have mentioned The Logistic Regression model provides the best combination of Precision and Recall for specific thresholds that can be leveraged to further improve the model performance, but this is not evident since there is considerable overlap between the PR curves for RandomForestClassifier and LogisticRegression. Please try to specify the particular threshold value you have decided to use for the classification task.
  2. Cannot see the spread of values of the numerical features and the permissible categories for the categorical features.
  3. During the model training and hyperparameter optimisation, max_features and max_depth are randomly selected from a range. You must set a seed for the same so that the code is reproducible.
  4. All scripts contain custom function definitions, but there is more room for modularization. eg. Instead of using a for loop to train the models, you can create a _trainmodel() function that does the same job. This will improve the readability of the script by reducing the length of the functions and segragating tasks performed on a single model.
  5. A figure for the feature correlation with respect to the target has been provided in the EDA and you've highlighted that there are more negatively correlated features than positive ones. No intuition or conclusion for the same has been provided.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

kenuiuc commented 1 year ago

Data analysis review checklist

Reviewer: kenuiuc

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1h

Review Comments:

I'm not pointing out mistakes. Instead I'd like to give some suggestions to potentially further improve the code quality:

Quitting from lines 12-16 (The_Report_of_Dropout_Prediction.Rmd)
Error in library(kableExtra) : there is no package called 'kableExtra'
Calls: <Anonymous> ... withVisible -> eval_with_user_handlers -> eval -> eval -> library

Execution halted
make: *** [Makefile:36: doc/The_Report_of_Dropout_Prediction.html] Error 1
============================================================
        PR Curve Plotting
============================================================
============================================================
        PR Curve Plotting Completed
============================================================
============================================================
        ROC Curve Plotting
============================================================
============================================================
        ROC Curve Plotting Completed
============================================================
============================================================
        Confusion Matrix Plotting
============================================================
============================================================
        Confusion Matrix Plotting Completed
============================================================
           logisticRegression  NaiveBayes  RandomForestClassifier
Recall               0.833887    0.664452                0.803987
Precision            0.865517    0.888889                0.916667
F1                   0.849408    0.760456                0.856637
Accuracy             0.877410    0.826446                0.888430
============================================================
        Model Testing Completed - End of Testing
============================================================
python src/general_EDA.py --input_path="data/processed/train_eda.csv" --output_pat
h="results/"
============================================================
        Begin General EDA
=

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

caesarw0 commented 1 year ago

Thanks, everyone for the feedback. We’re really grateful and appreciate you taking the time to share your thoughts with us. All the comments and suggestions are very useful and thoughtful. It could help us to accelerate the project to the next level. We will discuss the comments within our team, and hopefully reflect all the points mentioned if time permits.

Have a wonderful weekend everyone!

caesarw0 commented 1 year ago

Thanks once again for the valuable comment on our project. After discussing with the team, we have made the following improvement in response to the feedback.

Project Clarification:

(already implemented before peer review, thus, no commit link)

  1. The project is using recall score as the scoring metric in the CV result

  2. EDA is done using training data.

  3. The team agreed to keep this code documentation for consistency

Environment Setup:

  1. The package required to generate the report is added to the .yml file

  2. Removing conda prefix

  3. Added R dependencies

Code Changes:

  1. Added random state in the random forest model

Report Changes:

  1. Improvements were made to the framing of sentences in the report. The classification as Dropout is termed as a ‘positive’ in the context of this analysis.

  2. The threshold used for Logistic Regression is the default of 0.5 as provided by Sklearn. This is explicitly conveyed in the report.

  3. Rephrasing positive/negative correlation sentence

  4. Moving the longer version of the Jupiter notebook to archive

  5. Stating the EDA scope in the report