Submission: Group 3: Predicting Canada's Community Well-Being Index Scores

Submitting authors: Shawn Li, Selena Shew, Sri Chaitanya Bonthula, & Lesley Mai

Repository: https://github.com/DSCI-310-2024/DSCI_310_Milestone_1_Group_3/releases/tag/3.00

Abstract/executive summary:

In Canada, the Community Well-Being Index, or CWB for short, is a way for the Canadian government to assess and quantify the the socio-economic well-being in various communities. Each community is assessed in how well they are doing across four different fields: labour force activity, income, housing, and education. Scores are assigned to each of those categories, and are then used to calculate an overall index score to summarize how well the community is doing across all four areas. For more information, please go here: https://open.canada.ca/data/en/dataset/56578f58-a775-44ea-9cc5-9bf7c78410e6.

The CWB index scores are updated every year. While the data is publicly available on Canada's Open Governmental Data Portal, the actual analysis and model used to make these scores have not been released. Therefore, our group aims to create a linear regression model that can be used to predict the CWB scores, using values from the four categories as predictors.

Editor: @ttimbers

Reviewer: Marcela Flaherty Chuxuan Zhou Pragya Singhal Pahul Brar

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[ ] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[ ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

From looking through your readme, I was not able to find any specific usage examples. While it was nice that you told the user to install Jupyter Lab, you should add a step-by-step example, so a user who is not that good at command line usage to be able to install everything that is needed and set up the analysis. For example, rather than just leaving your dependencies in the environment.yml file, you should mention each dependency along with the version. I noticed that in the yml file, the versions are missing, which would make it hard to reproduce your results sometime in the future.

Also, there are no clear guidelines set up pretraining to third parties that wish to add to your repository, which is crucial as its important to let the third parties know their bounds and how they should operate in your project, if at all.

In terms of the analysis, everything looks good, I was able to run Make All and able to see your report. The only thing missing is proper documentation, I have noticed that some of your files are well-documented, and some are very poorly documented. For example, the data_analysis.R file, has so much code that is not being explained well at all. This carries over in most of your files, where it could be better documented to make it easier for an outsider to understand what's happening.

Another recommendation I would make is more regarding the data analysis itself. I was looking at your plots, particularly the Index of Different Variables of Inuit Community Plot, and to me this plot doesn't seem to be the best fit for the job. You are looking at the counts of 4 variables, and I would recommend using a histogram instead. I also would recommend choosing an appropriate colour palate, as currently, it's really hard to see the Labour Force Variable. Also, you should resize your correlation plot so its easier to see the correlation scores.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: marcesf

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[ ] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

Your project is good and it was an interesting topic. I've never heard of CWB before, so I thought your project was very informative. Beginning with documentation, something that stood out to me in the readme was that usage code such as "conda env create --file environment.yml" was not inputted as a code block. I think it would help with readability, and consistency as for example in your contributing file, there are separate code chunks. The readme was lacking some usage information, such as details on how to run tests.

Your qmd report includes authors but you should also have that in your ipynb. The report is well written and has a good structure that is easy to follow. But, I think that the results and conclusion could be expanded upon further. For example, instead of just listing 2 future exploration questions, as a reader I would have prefered a stronger explanation and discussion. With regards to results, there is a clear communication through tables and figures, and the expression of the model is clearly written but there is a lack of written discussion of the model performance.

Overall, I thought that the code quality was quite good and there are many code comments describing what is being done. That being said, it is notable that there is a lack of consistency with these comments and even comment style. Some are very vague. For example test_mean_values_function.R has a very strong explanation for each step, but test_lineplot_function.R does not. I thought that the tests were robust. fetch_data.r and its test should be changed to .R for consistency. Every function has an R file ending with _function except for fetch data as well.

With regard to the scripts & functions, there might be some reason for this, but I don't understand why only the function lineplot_function.R is implemented into the scripts.

Thank you, and I really enjoyed looking through your project!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: pragszz

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[ ] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[ ] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

The project has been well implemented, however, there can be improvements by adding some details required for a completely reproducible and trustworthy workflow. To begin with, I found the research question especially unique as it assigns an overall index score based on various factors for the Canadian community, which provides a more personalized score for users in this region carrying out the analysis. As for the GitHub repository, it is organized and clear to navigate for others who are not familiar with the format of a GitHub repository.

While they mention the list of dependancies being in the environment.yml file, it could be helpful to clearly state them in the README file so as to not have to open the file and in turn be more clear for others viewing the GitHub repository. The usage section includes instructions to run the analysis using docker, however, it would be a good idea to include instructions to do the same without using docker as well. This can be seen in this example project repository provided by the professor: https://github.com/DSCI-310-2024/DSCI_310_Milestone_1_Group_3/tree/main. The written tests also ensure reliability by including incorrect and correct cases, however, there could be more examples for the same to ensure that the function is also robust.

Moving on to the report, while it has been mentioned what the analysis intends to explore, it would be helpful to clearly state the research question in the python file as well, so that other users and collaborators are aware of what exactly is being addressed. Detailed background and importance for the analysis has been provided, potentially engaging readers, collaborators and stakeholders and ensuring a clear understanding of the project's context and objectives. The method being used in the analysis is clear after reading through the report, however, it is tricky to distinguish the methods and results section from the EDA and visualization section making it harder to follow along with the content in the report.

Overall, the project follows most of the requirements of a reproducible and trustworthy workflow, however, some key elements are missing, which if implemented, can elevate and improve the project.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[ ] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3

Review Comments:

The project provides a comprehensive analysis of Canada's Community Well-Being Index Scores, offering valuable insights into the socio-economic well-being of Canadian communities.

When talking about areas of excellence, I want to say two things. The first is that the methodology is robust, employing linear regression to predict CWB scores effectively. The choice of predictors and the analytical approach are well justified. The second is that the analysis report is thoroughly documented, with clear communication of results through comprehensive tables and figures. The narrative is engaging and informative.

When talking about suggestions for improvement. The third thing is about usage examples. Enhance the README documentation by providing more detailed step-by-step usage examples, especially for users unfamiliar with command-line operations. Including version numbers for dependencies can further aid in ensuring reproducibility.The fourth thing is about the community guidelines. Elaborate on the community guidelines to offer clear instructions for external contributors on how they can participate in the project. This will foster a more collaborative and open project environment. The fifth thing is about documentation consistency. Strive for consistency in documentation across all scripts and functions. Some files are excellently documented, while others could benefit from more detailed explanations to improve understanding for external collaborators.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DSCI-310-2024 / data-analysis-review-2024