DSCI-310 / data-analysis-review-2021

0 stars 1 forks source link

Submission: 3: Analyzing Education's Effect on Capital Gains #3

Open ttimbers opened 2 years ago

ttimbers commented 2 years ago

Submitting authors: @rlaze @YellowPrawn @alexkhadr

Repository: https://github.com/DSCI-310/DSCI-310-Group-3

Abstract/executive summary: This project looks into US 1994 census data to investigate the effects of higher education, and hours worked. We come up with a predictive model that estimates the amount of hours worked per week based on the amount of years of education.

Editor: @ttimbers

Reviewer: @danielhou13 @overcast-day @ZiyueChloeZhang @lizhe918

danielhou13 commented 2 years ago

Data analysis review checklist

Reviewer: danielhou13

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: ~60 minutes

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. Functionality documentation: Each script file is very straightforward, which lets the reader easily interpret what each script does. This is also the case for each function as they are all well documented. The only thing that would make the documentation stronger, in my opinion, is to add a reference to the polyfit function for NumPy.

  2. Automation: There are some issues with reproducing the analysis. When using the docker image, there is an error with running the makefile. I can't properly run the analysis using the procedure listed readme.md. I get the following error so I have to manually run each script in the makefile.

(base) jovyan@5ffe8a43bc2b:~$ cd work
(base) jovyan@5ffe8a43bc2b:~/work$ make clean
bash: make: command not found
(base) jovyan@5ffe8a43bc2b:~/work$ 

Also, the lack of a data folder in the repository means we have to create one manually (granted that isn't too hard to do) to run the script. image

However, there is also a missing dependency for the docker image. Jbook is not in the dockerfile, but can be easily fixed with an extra import in the dockerfile image

  1. Conclusions:: One point in your discussion says that "Looking at the graphs for Canadians (Fig. 4) and Americans (Fig. 5) only, we see that there is actually a negative correlation between hours worked and education level. This is a really good point to talk about because it seems to be an unexpected point compared to the rest of the discussion, but be careful about the wording because it's only the case for the Canadian figure.

image

Figure 5, shows a positive correlation between hours worked and level of education.

image

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

overcast-day commented 2 years ago

Data analysis review checklist

Reviewer: overcast-day

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 90 minutes

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. File structure: missing folder for storing data and script results, I suggest calling them something like ‘data’ and ‘results’ to store all the raw and processed data, as well as the graphs and tables produced by your scripts. Due to the missing folders, I’m also unable to run the scripts which use ‘data/” as a filepath. In case I’m looking at the incorrect version of the project: I visited your project via this link: https://github.com/DSCI-310/DSCI-310-Group-3.
  2. The dependencies are clearly stated along with the versions used. However, there’s some discrepancies between your dockerfile and the listed dependencies on the README. For example, ‘pytest’ is listed in the dockerfile but not included in the readme documentation.
  3. I suggest adding some more details to your CONTRIBUTING.md by elaborating more on how to report bugs or contribute such as how the issue should be formatted and potentially other ways someone could contribute to the project.
  4. All the tests pass when I use ‘pytest tests’ via the command line. Nevertheless, to make the tests more robust, I suggest adding a test to see if the input is of the correct type. For example, for the ‘test_filter.py’ I think it would be helpful to test if the input is a dataframe or not.
  5. The report is brief, but insightful and aims to analyze a topic that, for many who have aimed to get a degree or certification, is something that is extremely important. It was a nice read! That’s why there’s all the more reason to include the names of the authors into the report!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

lizhe918 commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 60 minutes

Review Comments:

In general, the work is impressive with visible effort spent on this project. The function documentations are remarkable, and the script arrangement is well organized. However, there are still some aspects that need improvement:

  1. Please consider provide a method to contact you. Currently, there is no clear way of contacting you for collaboration and contribution. I know this was not clearly required in the project description of the previous milestones, but I think it is nice to have. Very unfortunately, GitHub does not have a chat feature which trivializes the communication between developers.

  2. You did not specify the development tools you used for this project. As your classmate, I personally know that you used Docker, JupyterLab, and maybe VS Code. However, they are not obvious to the people outside of this course. More importantly, you did not specify the python version for this project in obvious place, and that is fatal. I understand that your docker has the environment, but the detailed dependency list is still highly recommended.

  3. In your report, you did not discuss your approach to the problem you raised, and also did not explain why you choose the machine learning model used in the project. You may want to briefly justify your choice of model. Also, I want to comment on the readability of your report. Your report still contains code, and some images are not loaded when I view the report on GitHub. You may want to export a pdf file in addition to the ipynb file you currently have so that others can conveniently see your work without the need to go through the entire configuration process to run your proejct.

Overall, your work is great!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

ZiyueChloeZhang commented 2 years ago

Data analysis review checklist

Reviewer: @ZiyueChloeZhang

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Review Comments:

  1. The README.md provides clear and thorough instructions for running the project. However, it could be better if it is formatted more nicely. For example, using tables to present the dependencies, using in-line code block for the path of the final report, adding headers to differentiate the steps to run the project and the test, adding line breaks to make the license information stand out.

  2. The report is nicely written and rendered. I suggest the following minor changes:

    • fixing grammar mistakes. For example, the report is missing periods at the end of the second and third paragraph of the introduction.
    • adding labels for x and y axis of the plots with units.
    • fixing the header. Summary, Introduction, etc. should not be subheaders under Authours.
  3. The citation in the report is done in a clean manner. It is a good idea to reference the programming language and dependencies used for this analysis in the report.

  4. Team-Work-Contract.pdf should not be included.

  5. The /data directory contains raw data, processed data and plots generated by the scripts. It is better to make sub directories like /data/raw and /data/processed to organize the data, and a /results directory to store the plots.

  6. The following error occurred when I run the make file on the docker container:

    OSerror:"Cannot save file into a non-existent dicectory: data" 

    I solved this by manually add a \data directory. I would suggest debugging a little bit in the makefile.

  7. There is no analysis part of the results, a non-expert would be confused if there is no explanation of the result. I'd also recommend including further explanations in the final report to make it more understandable. If you could explain a little more in the result section, I think it would be easier for the audience that isn't an expert in Data Science and Statistics to grasp what you've done.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.