Open ttimbers opened 2 years ago
1 Hour
Documentation:
analysis.ipynb
should not be in the root directory. Maybe move it under /docs
or /notebook
? I've also noticed that there's 3 versions of the file present; it might be useful to only keep the necessary onesCode quality:
check.py
is quite ambiguous. There isn't any documentation so it's hard to interpret what it's meant to doplot-stacked-chart.py
isn't named in the same way as the other scripts (minor detail)test_alpha_tuning.py
, test_KNN_tuning.py
needs documentation in each test so it's easier to interpret what each test is doingREADME.md
?Reproducibility (just a note):
Analysis report:
Overall, this project is very well written and covers all essential bases. As per the comments in the above, most of the issues that I have spotted in your project are very minor and can be fixed relatively quickly. It is interesting that you have decided to use R markdown as a way of rendering your report, maybe it would've been better to do it in Jupyter book? maybe it wouldn't. Well done!
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
You could improve your project organization inside of the src folder by placing the scripts in the order in which they should be run. This helps someone that is checking your work as they can simply look through the scripts and cross-check them without having to open your Makefile and check the order there.
Contributing file: This file is very simple and while it gives basic instructions for how to contribute, it does not specify expectations for contributions. I would also recommend describing how contributions from direct contributors should differ from external contributors. Additonaly, you mention contributing with a feature request and with a bug. Specify how the expectations for these contributions differ. In the example of creating a bug report, you could include expectations for title clarity, describing the steps needed to reproduce the error, behavior observed with the error and details about configuration and environment.
check.py: I had some difficulty understanding what this file is used for. It would be interesting to add a comment at the beginning explaining the usage and purpose of this file.
Result folder: Some of the results are described by simple numerical order (plot1, plot2, plot3...) while others have descriptive file names (conf_mat, cv_plot...). This inconsistency is a little bit confusing to non-authors. I would recommend defining one of the two styles and adapting accordingly.
Report: On the section "Variable Data types and Modifications", you do a histogram of frequency distributions of chosen files and matrix of correlations between features. While someone with a data analysis background will certainly and immediately understand the purpose of these two analysis. I would recommend that you give a brief explanation of why they were included (before you display them) to make your report comprehensive for a wider audience.
Report: In "Frequency Distributions of the chosen variables", you add a note stating that "correlations between categorical features should be ignored as these are invalid". While the note is helpful, not displaying these correlations at all would make your report clearer and the interpretation easier to understand.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
On exploring the project, I came across a file called check.py. I had some problems understanding what the file is doing. I would suggest to add some documentation or some comments to help an external reader understand the file better.
There were a few places where .ipynb_checkpoint was present. Although I don't think it matters much, but I would suggest you to either hide this folder using gitignore or maybe remove this folder since it causes unnecessary repetition of the ipynb report.
I was quite impressed by the effort this team put for using branches. It showed that the team followed the course/project guidelines thoroughly. However, I would suggest you to keep closing those branches that are bug-free and have been merged into main. It would be very time consuming for the reader in the case if he/she wanted to explore all the branches.
There were a few plots that were described as plot1, plot2, etc, while some plots had descriptive titles/name. I feel that this is not a consistent approach and would suggest you to follow either of the conventions to help the external readers understand your project better.
Overall, great job! You guys have adhered to the guidelines and have created a very well structured project. I feel the suggestions are just some minor changes to the repository and can be fixed quickly. I, also, liked that you guys have used R makrdown to render your report since it allows the results of R code to be directly inserted into formatted documents.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
I think it is a good project which lands well on the computational side. I found the functions and testing to be well developed and carefully thought. I think the project is well structured and the idea of the project was solid from the beginning. It seems like a project where everyone worked in a fluent way which lead to a project which does not seem as a "glue" of parts. The project is easy to deploy thanks to the well done README file, while the makefile was well developed with no errors in the process. The conclusions are solid. Overall I think the observations that have been mentioned will help improve to a really good research paper (if it were the case).
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Submitting authors: @nkoda @mahdiheydar @izk20 @harrysyz99
Repository: https://github.com/DSCI-310/DSCI-310-Group-10
Abstract/executive summary: The KNN-Classification model was applied to 2017 Canadian census data to predict whether an individual made money on their investments (true class) or broke even or lost money (false class) using their family size, and whether they are the major income earner in their family as features.
All investments contain a risk, so the rationale for this analysis was to gain insight into whether the pressures of being the main income earner in a family and having a larger family size have influence on predicting someones investment outcomes. This could be used to further analyze the risks associated within the specific investments for further analysis.
The KNN-model was tuned for the nearest neighbors hyperparameter. A value of 26 was used yielding approximately 57% accuracy. Therefore, the model did not perform much better compared to a dummy classifier. The KNN-classification model was not able to distinguish between individuals in the same family size group unlike the pattern found in the actual data.
It is important to build other models such as a support vector machine model (SVM), or carry out feature engineering or add other features that may serve as better predictors to gain more solid results. This will enhance the investigation of the original research question of how family size, and whether an individual is the major income earner in their family, can be used to predict investment outcomes.
Editor: @ttimbers
Reviewer: @YellowPrawn @ClaudioETC @isabelalucas @Jaskaran1116