DSCI-310 / data-analysis-review-2021

0 stars 1 forks source link

Submission: 1: Predicting students’ grades using multi-variable regression #1

Open ttimbers opened 2 years ago

ttimbers commented 2 years ago

Submitting authors: @TheAMIZZguy @danielhou13 @TimothyZG @gzzen

Repository: https://github.com/DSCI-310/DSCI-310-Group-1 Abstract/executive summary:

For this project, we put our focus on predicting students’ final grades. Being able to efficiently predict the final grade allows a student to track their current progress and plan in advanced. The dataset being used is recorded at UCI ML Repo. We are particularly interested in how would the following features provided in the data could contribute to the prediction of students’ final grade G3:

Since we are using a mixture of categorical variables and numeric variables to predict a quantitative result, the concept of least square regression analysis from DSCI 100 could be implemented and extended to fit our context.

We performed a 80-20 split on the dataset and trained a multi-variable least-square regression model on the training data with the 9 features we selected for the model. The simplest method of doing least squares regression is Ridge Regression, which is functionally similar to Linear Regression, but better at avoiding unexpected coefficients.

We test the model with cross validation and get an average cv-score of -4.61, which means an error of 4.61 and a final RMSE error of 3.83.

Editor: @ttimbers Reviewer: @rlaze @asmdrk @ayashaa

ayashaa commented 2 years ago

Data analysis review checklist

Reviewer: ayashaa

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 3

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

This was a very well written report with an interesting topic! I enjoyed reading and learning about the factors that may predict grades, and they analysis ran smoothly on my end. I am not entirely sure if this was a bug on my end from running it locally, but after running make all, there was no report output in the _build folder. Instead, there were several html files for each section (intro, methods, etc), which made it a bit more difficult to read through the whole report. I did not check off style guidelines because while many of them were followed, there was some style inconsistencies throughout the code and functions. For example, some function name files do not match the function definition; ex. plotSquareData.py and plot_square_data. Furthermore, some variable names follow camelCase while others follow snake_case. Finally, I think it would be interesting to expand your results/discussion section. Specifically, the paragraphs where you are discussing the features impacting true high grades. I think this is a really interesting part of your analysis, and would love to see more in depth analysis of these, such as a discussion about which maternal jobs have a positive vs negative effect. Or even some research/speculations as to why the mothers job has a higher impact than the fathers, why being in a relationship has a negative impact, etc. Overall, great work!! 👍

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

rlaze commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5

Review Comments:

The analysis was well done and presents several interesting questions. It would be very interesting to see how combinations of features can affect the final grade prediction, as that could give more interpretable results than comparing them all individually. The code itself is readable and well commented, but doesn't exactly fit this class' guidelines. Some functions use different formats for documentation, and not all of them include examples of how to run them. These issues are to be expected when multiple people work on a project, and a standard style guide could fix them. I checked the box for documentation as it provides enough information to understand the code, but this could be an issue for final grading. Tests are well done and cover edge cases, the only feedback I would give is to print error messages from the assert statements when they fail instead of print statements to simplify the output.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

asmdrk commented 2 years ago

Data analysis review checklist

Reviewer: asmdrk

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1-2

Review Comments:

This was a great analysis, the introduction in particular set up the analysis really well, clearly outlining the question being explored and the importance and potential uses that it could have. In fact, the quality of writing throughout the report was great, clearly explaining the choices and decisions made for the analysis and why, and also explaining their significance in an easy to understand manner. There was an odd issue where the analysis seems to be spit up into different documents instead of one single file, which is why I did not check readability. While I did not have any problem navigating it, someone less experienced with data science or statistics might not be sure what order to go through the report in(for example, exploratory analysis is one that might be hard to figure out for someone not familiar with data science), so it would be helpful(and convenient) for the reader to collate the analysis into a single file. Finally, while the conclusion from the results was good, I think the model could be assessed a little more thoroughly. While RMSE is a good way to assess performance, something like a residual plot could also be used to check for any bias in the model. Overall, there are only very minor issues in what is a great report! Well done!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.