Submitting authors: @artanzand @aimee0317 @jo4356 @xiangwxt

Repository: https://github.com/UBC-MDS/canadian_heritage_funding Report link: https://htmlpreview.github.io/?https://github.com/UBC-MDS/canadian_heritage_funding/blob/main/doc/canadian_heritage_funding_report.html

Abstract/executive summary:

We attempt to build a multi-class classification model which can use features not indicative of artistic merit, such as location, audience, and discipline to predict the funding size granted by the Canadian Heritage Fund (The Fund). We used four popular algorithms for classification questions, including logistics regression, Naive Bayes, C-Support Vector Classification (SVC) and Random Forest initially. We also used dummyclassifer as a base case. Then, we selected Random Forest as the best algorithm for our question based on each model’s cross-validation scores. We then further conducted hyperparameter optimization on the Random Forest model. Our model performs reasonably well comparing to the base case dummyclassifer with a macro average f-1 score of 0.68 and a weighted-average f-1 score of 0.68. However, we also observed that the model performs worse at classifying funding sizes in range of $12-23K and $23-50Kcomparing to classifying funding sizes of other ranges. Thus, we suggest further study to improve this classification model.

Editor: @artanzand @aimee0317 @jo4356 @xiangwxt
Reviewer: Rudyak_Rada, WANG_Tianwei, ORTEGA_Jasmine

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: Radascript

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Good job friends! I really like this project. This is the kind of data science project that makes the world better in a little way :) Your repo is well put together and easy to follow. Pleasure to review. Here's my observations and suggestions:

Summary:

I would love to see it expanded a little bit. When you jump into talking about how model underperformed with classifying certain funding sizes, I don't have much on the way of frame of reference. Would be great to have more of an overview, including what categories we are looking at, to be able to process this information.
Curious that there is no difference in funding year over year
Love the pipeline analysis! I would be curious to see some other ensemble models we learned recently. Ensembles seems like the way to go since we have small number of features.

Intro:

I love the intro! "For this project, we are trying to answer two intriguing questions: given information non-indicative of artistic merit, such as location, audience, and discipline, what would be the funding size for art projects related to Canadian heritage?" This is so well stated. As a general reader, I'm interested; as a data scientist, I quickly have clarity on the overview of this project.

Minor quip:

References can be formatted a little bit nicer
Paragraph explaining how you selected the categories in the Methods->Data can use a little polishing. I don't think you are using the labels you mention in it anymore
Review highlighting: you have many places where it highlights beyond specific segments

Modeling:

Have you considered tuning alpha hyperparameter?
You got higher testing score than training score? Interesting. Maybe comment on that?

Scripting:

I can't seem to activate the environment
Your processed data folder isn't empty in the repo
Model Selection Script outputs a lot of warnings in the console: Are they useful or should they be suppressed?
All scripts ran! Beautifully done :)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: jasmineortega

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

This such an interesting dataset and thoughtful question! Overall, the project is well structured and well executed. I really enjoyed learning about Canadian Heritage funding :) Also, I really appreciate the M-1 instructions ~ my group ran into this issue as well and I can see it being a frustrating issue for interested contributors.

A few suggestions:

Code of Conduct: "We can be reached by contacting UBC MDS program" is too vague. Someone who wants to reach out might be confused on who to contact, especially since UBC and MDS is so large (and cohorts change every year!). Explicitly add an email address as a point of contact to eliminate any confusion
Conda environment: The environment name should be renamed to reflect the project name. Environment is too vague and could be confusing in the long run.
- Along with Rada's comment, I also could not get the environment to activate! I get the error "Could not find conda environment: environment". It's not listed in the environments I have even though I downloaded the yaml and created the environment
Analysis:"Therefore, we divided the values into five categories: less than $8k, $8k stands for funding size in the range of 8k to 10k in CAD, $12 stands for funding size in the range of 12k to 23k in CAD"
- There is some inconsistency between the figure legend and the categories stated. In the legend, all the categories are a range, not a single value.
Analysis: _"Therefore, we selected random forest as the best performing model to conduct hyperparameter optimization and tuned max_features, max_depth" and "class_weight as well as the maxfeatures argument in CountVectorizer()."
- In the "data" and "results" section of this document, it's mentioned that there is some class imbalance. As a reader, I'm curious what the class_weight hyperparameter was set to! I think it would be insightful to state the final hyperparameters used by random forest, and possibly even the range of values tested for these hyperparameters.

5.. Analysis: In general, I think another figure could be added to the "Results and Discussion" section. Perhaps a confusion matrix would be an interesting visualization to add.

This is really really minor, but I initially found the phrase "features non-indicative of artistic merit, such as location, audience, and discipline" a little hard to decipher at first. I feel like it would be clearer to just say what features you're using without the context of non-artistic merit. Maybe in the analysis it could be mentioned that the features are interesting because they are not indicative of artistic merit. Again, super minor, but the phrase came up a couple times so I though I'd mention it!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Hi Jasmine, Thank you for your constructive feedback! There are many points that we will definitely be adding. One comments on your review: It is typical for projects to call their environment file environment.yaml as it is self-explanatory. Inside that yaml file, however, you can find the actual name of the environment which is the case for our file. To activate our environment you should type conda activate Cdn_heritage_funding.

Reviewer: Tianwei Wang

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

Overall, it's a great project, good job, Group 14. I like the idea of heritage funding size prediction problem.

Some suggestions:

How do you determine the threshhold for five categories funding size? Maybe you could do some exploring and discuss it in the report in more detail.
Maybe you could add a distribution figure of the funding size.
It's better to provide train score for those models as well, because I am wondering whether these models have overfitting or underfitting problem.
I think it would be interesting if you can carry out feature selection to reduce the number of features instead of just dropping some features. For example, Recursive Features Elimination could be a good start.
As a reader, I would appreciate it if you could add a default value for the input variable in the script.
If you can discuss something about feature importance, it will be more helpful for those readers who want to apply the fund.

Thank you friends! It's a great opportunity to learn from your project.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

[Responses to peer review and teaching staff review comments]

Summary of Addressed Feedbacks

1. Updated variable names in the final report

Author: @jasmineortega
Original comments: Analysis:“Therefore, we divided the values into five categories: less than $8k, $8k stands for funding size in the range of 8k to 10k in CAD, $12 stands for funding size in the range of 12k to 23k in CAD”. There is some inconsistency between the figure legend and the categories stated. In the legend, all the categories are a range, not a single value.
Actions: We agreed that the inconsistency between feature names in the report and the figure legend led to confusion. Therefore, we updated the feature names to keep them consistent.
See commits: https://github.com/UBC-MDS/canadian_heritage_funding/commit/259a7120f0cca7de7abcca099b9f317a9d0cfc7e
2. Fixed contact info in Code of Conduct:
Author: @jasmineortega
Original comments: Code of Conduct: "We can be reached by contacting UBC MDS program" is too vague. Someone who wants to reach out might be confused on who to contact, especially since UBC and MDS is so large (and cohorts change every year!). Explicitly add an email address as a point of contact to eliminate any confusion
Actions: We agree with @jasmineortega that we need to put a specific contact email instead of using UBC MDS program. The issue has been fixed.
See commit: https://github.com/xiangwxt/canadian_heritage_funding/commit/07a0ed137aa9af08feefeb12fcdd0981d07bd766
3. Rephrased hyperparameter optimization in the final report discussion:
Author: @jasmineortega
Original comments: Analysis: "Therefore, we selected random forest as the best performing model to conduct hyperparameter optimization and tuned max_features, max_depth" and "class_weight as well as the max_features argument in CountVectorizer()." The issue has been revised, please see two commits below (there was a mistake in the first commit, and revised with the second one)
Actions: We mixed hyperparameter tuning for CountVectorizer and RandomForestClassifier in the original report, and have fixed this issue.
See commits: https://github.com/xiangwxt/canadian_heritage_funding/commit/2d643b72ba13496e98a3a159d62df50e4f5185b2 https://github.com/xiangwxt/canadian_heritage_funding/commit/c6e87dcfbf691d8d6f13a9fe4ef2a04f8a91d767
4. Increased the resolution of images in the final report
Author: @andytai7
Original comments: Needs a higher resolution.
Actions: We agreed the images in the final report had low resolution, we fixed the script to generate the images in the final report.
See commit: https://github.com/UBC-MDS/canadian_heritage_funding/commit/3291034831fa877c13e70140c173ca194197bbd8
5. Explain the threshold used to divide target in the final report
Author: @Davidwang11
Original comments: How do you determine the threshold for five categories funding size? Maybe you could do some exploring and discuss it in the report in more detail.
Actions: It is worth explaining how we divided our target into the five groups, so we provided the detail in the final report.
See commit: https://github.com/UBC-MDS/canadian_heritage_funding/commit/ba74ce8a20605995f700bdcfdfd887ae040b91f1
6. Delete duplicate README file
Author: @andytai7
Original comments: Project organization and documentation expectations: Mechanics. Instead of just keeping one, there are two readmes in the root director.
Actions: We deleted the rmd file since the md file renders better on Github and it is what the Github use for README file.
See commit: https://github.com/UBC-MDS/canadian_heritage_funding/commit/066deb2af64e612ef550304af20cc9e14234616d
7. Improved EDA report
Author: @andytai7
Original comments: We need more rationale in terms of why you are doing certain parts in the EDA file
Actions: Added more reasonings to the EDA report
See commit: https://github.com/UBC-MDS/canadian_heritage_funding/commit/612a9a99f3805a9237c8962a35edab1288ea7f09
8. Fixed figure captions in the report
Anthor: andytai7
Original comments: If figure captions are not provided the plot should be clearly explained in the text. I would recommend using figure captions.
Actions: We fixed the problem so the figure captions will show in our html report now
See commit: https://github.com/UBC-MDS/canadian_heritage_funding/commit/a939e6c41fb659ee56152657a48302a60d37835c

UBC-MDS / data-analysis-review-2021

Submission: GROUP 14 : Canadian Heritage Funding #22

Data analysis review checklist

Reviewer: Radascript

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Review Comments:

Attribution

Data analysis review checklist

Reviewer: jasmineortega

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Review Comments:

Attribution

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2

Review Comments:

Attribution

[Responses to peer review and teaching staff review comments]

Summary of Addressed Feedbacks

1. Updated variable names in the final report

2. Fixed contact info in Code of Conduct:

3. Rephrased hyperparameter optimization in the final report discussion:

4. Increased the resolution of images in the final report

5. Explain the threshold used to divide target in the final report

6. Delete duplicate README file

7. Improved EDA report

8. Fixed figure captions in the report