Submission: 10: Investment Outcome Predictor

Submitting authors: @nkoda @mahdiheydar @izk20 @harrysyz99

Repository: https://github.com/DSCI-310/DSCI-310-Group-10

Abstract/executive summary: The KNN-Classification model was applied to 2017 Canadian census data to predict whether an individual made money on their investments (true class) or broke even or lost money (false class) using their family size, and whether they are the major income earner in their family as features.

All investments contain a risk, so the rationale for this analysis was to gain insight into whether the pressures of being the main income earner in a family and having a larger family size have influence on predicting someones investment outcomes. This could be used to further analyze the risks associated within the specific investments for further analysis.

The KNN-model was tuned for the nearest neighbors hyperparameter. A value of 26 was used yielding approximately 57% accuracy. Therefore, the model did not perform much better compared to a dummy classifier. The KNN-classification model was not able to distinguish between individuals in the same family size group unlike the pattern found in the actual data.

It is important to build other models such as a support vector machine model (SVM), or carry out feature engineering or add other features that may serve as better predictors to gain more solid results. This will enhance the investigation of the original research question of how family size, and whether an individual is the major income earner in their family, can be used to predict investment outcomes.

Editor: @ttimbers

Reviewer: @YellowPrawn @ClaudioETC @isabelalucas @Jaskaran1116

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[ ] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[ ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1 Hour

Review Comments:

Documentation:
- Minor detail, analysis.ipynb should not be in the root directory. Maybe move it under /docs or /notebook? I've also noticed that there's 3 versions of the file present; it might be useful to only keep the necessary ones
Code quality:
- check.py is quite ambiguous. There isn't any documentation so it's hard to interpret what it's meant to do
- plot-stacked-chart.py isn't named in the same way as the other scripts (minor detail)
- test_alpha_tuning.py, test_KNN_tuning.py needs documentation in each test so it's easier to interpret what each test is doing
- test instructions are required. Maybe put some basic instructions to run all your tests in the README.md?
Reproducibility (just a note):
- Automation was not ticked as I was unable to run the analysis on the M1 Mac ARM chip. As Tiffany noted on piazza, this is a currently unsolved error in the community (not an error on your end so treat this as full marks for reproducibility!)
Analysis report:
- Authors are not included.

Overall, this project is very well written and covers all essential bases. As per the comments in the above, most of the issues that I have spotted in your project are very minor and can be fixed relatively quickly. It is interesting that you have decided to use R markdown as a way of rendering your report, maybe it would've been better to do it in Jupyter book? maybe it wouldn't. Well done!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Reviewer: Isabela Lucas Bruxellas

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3 Hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Things that were done particularly well:

Congratualations! Your instructions of the README file are incredibly easy to follow and detailed. I was able to run your project using the docker without any problems and arrived at a well-organized repository. I was also able to successfully run the make clear and make all commands. The report is comprehensive and clear. I also ran your tests and didn't see any failures. The result of this project is really impressive.

Thing to improve

You could improve your project organization inside of the src folder by placing the scripts in the order in which they should be run. This helps someone that is checking your work as they can simply look through the scripts and cross-check them without having to open your Makefile and check the order there.
Contributing file: This file is very simple and while it gives basic instructions for how to contribute, it does not specify expectations for contributions. I would also recommend describing how contributions from direct contributors should differ from external contributors. Additonaly, you mention contributing with a feature request and with a bug. Specify how the expectations for these contributions differ. In the example of creating a bug report, you could include expectations for title clarity, describing the steps needed to reproduce the error, behavior observed with the error and details about configuration and environment.
check.py: I had some difficulty understanding what this file is used for. It would be interesting to add a comment at the beginning explaining the usage and purpose of this file.
Result folder: Some of the results are described by simple numerical order (plot1, plot2, plot3...) while others have descriptive file names (conf_mat, cv_plot...). This inconsistency is a little bit confusing to non-authors. I would recommend defining one of the two styles and adapting accordingly.
Report: On the section "Variable Data types and Modifications", you do a histogram of frequency distributions of chosen files and matrix of correlations between features. While someone with a data analysis background will certainly and immediately understand the purpose of these two analysis. I would recommend that you give a brief explanation of why they were included (before you display them) to make your report comprehensive for a wider audience.
Report: In "Frequency Distributions of the chosen variables", you add a note stating that "correlations between categorical features should be ignored as these are invalid". While the note is helpful, not displaying these correlations at all would make your report clearer and the interpretation easier to understand.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Jaskaran1116

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour 15 minutes

Review Comments:

Components that are constructed well

This team has done a great job on this project! The repository was clear and well structured to explore all the components of the project, and the instructions on the README.md file are detailed and easy to follow along. The instructions allowed me to clone and and reproduce the analysis using the 'make commands'. The project was building off of docker without any bugs. The report clearly stated the predictive question and is quite comprehensive and easy to understand

Components to improve on

On exploring the project, I came across a file called check.py. I had some problems understanding what the file is doing. I would suggest to add some documentation or some comments to help an external reader understand the file better.
There were a few places where .ipynb_checkpoint was present. Although I don't think it matters much, but I would suggest you to either hide this folder using gitignore or maybe remove this folder since it causes unnecessary repetition of the ipynb report.
I was quite impressed by the effort this team put for using branches. It showed that the team followed the course/project guidelines thoroughly. However, I would suggest you to keep closing those branches that are bug-free and have been merged into main. It would be very time consuming for the reader in the case if he/she wanted to explore all the branches.
There were a few plots that were described as plot1, plot2, etc, while some plots had descriptive titles/name. I feel that this is not a consistent approach and would suggest you to follow either of the conventions to help the external readers understand your project better.

Overall, great job! You guys have adhered to the guidelines and have created a very well structured project. I feel the suggestions are just some minor changes to the repository and can be fixed quickly. I, also, liked that you guys have used R makrdown to render your report since it allows the results of R code to be directly inserted into formatted documents.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hrs.

Review Comments:

Well made points

I think it is a good project which lands well on the computational side. I found the functions and testing to be well developed and carefully thought. I think the project is well structured and the idea of the project was solid from the beginning. It seems like a project where everyone worked in a fluent way which lead to a project which does not seem as a "glue" of parts. The project is easy to deploy thanks to the well done README file, while the makefile was well developed with no errors in the process. The conclusions are solid. Overall I think the observations that have been mentioned will help improve to a really good research paper (if it were the case).

Points to improve

I consider a Gitignore file could have been a good idea to avoid having some files I consider "non-essential" (ipyinb checkpoints can confuse readers which are not familiar with the difference between such file and the pure ".ipyinb" file.)
I encountered some discrepancies when it came to the .html file: Some of the plots (histograms) differ from the ones I saw in the original .ipynb file. This confused me because in the beginning I did not know which plots to use as reference for what you were saying in the report. I also got mislead because I didn't understand the scale measurement in the x-axis because the variables were found confusing to understand, in some cases, which plot they had been assigned to.
When it comes to the functions, I think you could have described them a bit more detailed, since I saw some functions inside /src/analysis/ that had the same description. Perhaps they are doing one big thing together, but then it should be specified what granular thing it is that they are doing.
I also think that in the beginning you could have described the Machine Learning/Data mining algorithms rather than just landing on them straight on. An introduction of those algorithms and how they work could be well appreciated.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DSCI-310 / data-analysis-review-2021