UBC-MDS / data-analysis-review-2023

0 stars 0 forks source link

Submission: GROUP 15: Student Success Predictor #23

Open czwcandy opened 7 months ago

czwcandy commented 7 months ago

Submitting authors: @czwcandy @tangyl92 @hchqin @billwan96

Repository: https://github.com/UBC-MDS/Student_Success_Predict_Group15 Report link: https://ubc-mds.github.io/Student_Success_Predict_Group15/student_success.html Abstract/executive summary: In our study, we developed machine learning models, including SVM, Random Forest, and Logistic Regression (with L1 and L2 regularization), to predict the likelihood of student academic dropout in higher education. Due to a high number of features and their inter-correlations, our models initially exhibited overfitting. To address this, we implemented feature selection techniques (PCA and feature importance analysis) along with model’s parameter optimization. The refined models demonstrated improved performance, evidenced by a narrow gap between training and testing accuracy. Among the three, SVM marginally outperformed the others, achieving an accuracy of 80% and an AUC score of 0.89. Nonetheless, there is potential for further enhancement in model performance through additional feature engineering and more extensive parameter tuning.

Editor: @czwcandy Reviewer: <@Sampsonyu> <@hema2022ubc> <@sho-i98> <@lichunubc>

Sampsonyu commented 7 months ago

Data analysis review checklist

Reviewer: <@Sampsonyu>

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

hema2022ubc commented 7 months ago

Data analysis review checklist

Reviewer: <@hema2022ubc>

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

The project focus on the problem of student academic dropout in higher education. It is well-organized, with a clear separation between scripts, data, and reports. The documentation provides essential guidance, and the use of Docker ensures good reproducibility. The writing is of good quality, concise, and informative, with the summary offering a clear overview. The methods are described directly, and the results are effectively communicated. Below are suggestions for improvement:

  1. Usage Via Docker: In the Usage Via Docker section, the commands for running the analysis in step 3 presume the user is within the scripts folder, while the command to build an HTML report in step 4 assumes the user is at the project root. The transition between directories might not be evident, potentially leading to confusion. Including explicit cd (change directory) commands within the instructions, or restructuring all commands to be executable from the same directory, would be advantageous.

  2. Analysis Report Enhancements: While the report thoroughly details the methodology and data analysis, expanding on the background of the problem and its significance would provide a more comprehensive understanding of the study. The font size in Fig.1 and Fig.2 is too small, making it difficult for readers to interpret the data. Increasing the font size for better legibility is recommended. Additionally, meaningful column names are essential for understanding data tables. The Unnamed: 0 columns in Fig.4 and Fig.5 should be given descriptive titles that accurately reflect the data they represent.

  3. Proofreading: Some errors, such as the presence of placeholder text in the first reference of the README and the environment.yml file not being updated, could be avoided with thorough proofreading.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

lichunubc commented 7 months ago

Data analysis review checklist

Reviewer: <@lichunubc>

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Highlights - Things that impressed me:

  1. Extremely well designed Repo main page, with clearly structured contents and intuitive emoji/icons to provide enhanced visual understanding.
  2. The use of RandomForest model - your group have taken the initiative to conduct research and/or read lecture notes ahead to use superior methods to reach the optimal solution.
  3. Figure 1 is visually stunning - the use of overlayed histogram, density chart and other features provided content rich graphs.

Suggestions:

  1. The size of fonts in some graphs are a bit too small to see (it could be an isolated problem that only applicable to me). For example, the legend of Figure 1 and the axis labels in Figure 2.
  2. I believe it would be helpful to briefly discuss the main findings of your model, such as what features are most significant in predicting the dropout rate, in the Summary section (first paragraph) of the study, so that the audiences can read predictively.
  3. I think the report would benefit from explicitly displaying the magnitude of features so that the readers can learn the severability of each feature.

Overall, great job Group 15. I have learned much from your project and I can't wait to read more!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

sho-i98 commented 7 months ago

Data analysis review checklist

Reviewer: sho-i98

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5hrs

Review Comments:

Overall, very thoughtful report with multiple modeling methods! Really enjoyed going through your analysis about students success. Your visualization was very sophisticated and informative. The areas you could improve your project:

It was very good report, I enjoyed reading them!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.