Submission: 1: Predicting students’ grades using multi-variable regression

Submitting authors: @TheAMIZZguy @danielhou13 @TimothyZG @gzzen

Repository: https://github.com/DSCI-310/DSCI-310-Group-1 Abstract/executive summary:

For this project, we put our focus on predicting students’ final grades. Being able to efficiently predict the final grade allows a student to track their current progress and plan in advanced. The dataset being used is recorded at UCI ML Repo. We are particularly interested in how would the following features provided in the data could contribute to the prediction of students’ final grade G3:

study time: The number of hours spend studying per week.
Pstatus: Whether the parents are living together or seperated.
Medu: Mother's education level.
Fedu: Father's education level.
Mjob: The mother's job.
Fjob: The father’s job.
goout: Frequency of that student hanging out with friends.
romantic: Whether the student is in a romantic relationship.
travel time: How long it takes for the student to get to school.

Since we are using a mixture of categorical variables and numeric variables to predict a quantitative result, the concept of least square regression analysis from DSCI 100 could be implemented and extended to fit our context.

We performed a 80-20 split on the dataset and trained a multi-variable least-square regression model on the training data with the 9 features we selected for the model. The simplest method of doing least squares regression is Ridge Regression, which is functionally similar to Linear Regression, but better at avoiding unexpected coefficients.

We test the model with cross validation and get an average cv-score of -4.61, which means an error of 4.61 and a final RMSE error of 3.83.

Editor: @ttimbers Reviewer: @rlaze @asmdrk @ayashaa

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: ayashaa

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[ ] Style guidelines: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations? NOTE: no affiliations
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

This was a very well written report with an interesting topic! I enjoyed reading and learning about the factors that may predict grades, and they analysis ran smoothly on my end. I am not entirely sure if this was a bug on my end from running it locally, but after running make all, there was no report output in the _build folder. Instead, there were several html files for each section (intro, methods, etc), which made it a bit more difficult to read through the whole report. I did not check off style guidelines because while many of them were followed, there was some style inconsistencies throughout the code and functions. For example, some function name files do not match the function definition; ex. plotSquareData.py and plot_square_data. Furthermore, some variable names follow camelCase while others follow snake_case. Finally, I think it would be interesting to expand your results/discussion section. Specifically, the paragraphs where you are discussing the features impacting true high grades. I think this is a really interesting part of your analysis, and would love to see more in depth analysis of these, such as a discussion about which maternal jobs have a positive vs negative effect. Or even some research/speculations as to why the mothers job has a higher impact than the fathers, why being in a relationship has a negative impact, etc. Overall, great work!! 👍

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[ ] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5

Review Comments:

The analysis was well done and presents several interesting questions. It would be very interesting to see how combinations of features can affect the final grade prediction, as that could give more interpretable results than comparing them all individually. The code itself is readable and well commented, but doesn't exactly fit this class' guidelines. Some functions use different formats for documentation, and not all of them include examples of how to run them. These issues are to be expected when multiple people work on a project, and a standard style guide could fix them. I checked the box for documentation as it provides enough information to understand the code, but this could be an issue for final grading. Tests are well done and cover edge cases, the only feedback I would give is to print error messages from the assert statements when they fail instead of print statements to simplify the output.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: asmdrk

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[ ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1-2

Review Comments:

This was a great analysis, the introduction in particular set up the analysis really well, clearly outlining the question being explored and the importance and potential uses that it could have. In fact, the quality of writing throughout the report was great, clearly explaining the choices and decisions made for the analysis and why, and also explaining their significance in an easy to understand manner. There was an odd issue where the analysis seems to be spit up into different documents instead of one single file, which is why I did not check readability. While I did not have any problem navigating it, someone less experienced with data science or statistics might not be sure what order to go through the report in(for example, exploratory analysis is one that might be hard to figure out for someone not familiar with data science), so it would be helpful(and convenient) for the reader to collate the analysis into a single file. Finally, while the conclusion from the results was good, I think the model could be assessed a little more thoroughly. While RMSE is a good way to assess performance, something like a residual plot could also be used to check for any bias in the model. Overall, there are only very minor issues in what is a great report! Well done!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DSCI-310 / data-analysis-review-2021