Submission: 3: Analyzing Education's Effect on Capital Gains

Submitting authors: @rlaze @YellowPrawn @alexkhadr

Repository: https://github.com/DSCI-310/DSCI-310-Group-3

Abstract/executive summary: This project looks into US 1994 census data to investigate the effects of higher education, and hours worked. We come up with a predictive model that estimates the amount of hours worked per week based on the amount of years of education.

Editor: @ttimbers

Reviewer: @danielhou13 @overcast-day @ZiyueChloeZhang @lizhe918

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: danielhou13

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[ ] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[ ] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: ~60 minutes

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Functionality documentation: Each script file is very straightforward, which lets the reader easily interpret what each script does. This is also the case for each function as they are all well documented. The only thing that would make the documentation stronger, in my opinion, is to add a reference to the polyfit function for NumPy.
Automation: There are some issues with reproducing the analysis. When using the docker image, there is an error with running the makefile. I can't properly run the analysis using the procedure listed readme.md. I get the following error so I have to manually run each script in the makefile.

(base) jovyan@5ffe8a43bc2b:~$ cd work
(base) jovyan@5ffe8a43bc2b:~/work$ make clean
bash: make: command not found
(base) jovyan@5ffe8a43bc2b:~/work$

Also, the lack of a data folder in the repository means we have to create one manually (granted that isn't too hard to do) to run the script.

However, there is also a missing dependency for the docker image. Jbook is not in the dockerfile, but can be easily fixed with an extra import in the dockerfile

Conclusions:: One point in your discussion says that "Looking at the graphs for Canadians (Fig. 4) and Americans (Fig. 5) only, we see that there is actually a negative correlation between hours worked and education level. This is a really good point to talk about because it seems to be an unexpected point compared to the rest of the discussion, but be careful about the wording because it's only the case for the Canadian figure.

Figure 5, shows a positive correlation between hours worked and level of education.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist