UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: GROUP 14 : Canadian Heritage Funding #22

Open artanzand opened 2 years ago

artanzand commented 2 years ago

Submitting authors: @artanzand @aimee0317 @jo4356 @xiangwxt

Repository: https://github.com/UBC-MDS/canadian_heritage_funding Report link: https://htmlpreview.github.io/?https://github.com/UBC-MDS/canadian_heritage_funding/blob/main/doc/canadian_heritage_funding_report.html

Abstract/executive summary:

We attempt to build a multi-class classification model which can use features not indicative of artistic merit, such as location, audience, and discipline to predict the funding size granted by the Canadian Heritage Fund (The Fund). We used four popular algorithms for classification questions, including logistics regression, Naive Bayes, C-Support Vector Classification (SVC) and Random Forest initially. We also used dummyclassifer as a base case. Then, we selected Random Forest as the best algorithm for our question based on each model’s cross-validation scores. We then further conducted hyperparameter optimization on the Random Forest model. Our model performs reasonably well comparing to the base case dummyclassifer with a macro average f-1 score of 0.68 and a weighted-average f-1 score of 0.68. However, we also observed that the model performs worse at classifying funding sizes in range of $12-23K and $23-50Kcomparing to classifying funding sizes of other ranges. Thus, we suggest further study to improve this classification model.

Editor: @artanzand @aimee0317 @jo4356 @xiangwxt
Reviewer: Rudyak_Rada, WANG_Tianwei, ORTEGA_Jasmine

Radascript commented 2 years ago

Data analysis review checklist

Reviewer: Radascript

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Good job friends! I really like this project. This is the kind of data science project that makes the world better in a little way :) Your repo is well put together and easy to follow. Pleasure to review. Here's my observations and suggestions:

Summary:

Intro:

Minor quip:

Modeling:

Scripting:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

jasmineortega commented 2 years ago

Data analysis review checklist

Reviewer: jasmineortega

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Review Comments:

This such an interesting dataset and thoughtful question! Overall, the project is well structured and well executed. I really enjoyed learning about Canadian Heritage funding :) Also, I really appreciate the M-1 instructions ~ my group ran into this issue as well and I can see it being a frustrating issue for interested contributors.

A few suggestions:

  1. Code of Conduct: "We can be reached by contacting UBC MDS program" is too vague. Someone who wants to reach out might be confused on who to contact, especially since UBC and MDS is so large (and cohorts change every year!). Explicitly add an email address as a point of contact to eliminate any confusion

  2. Conda environment: The environment name should be renamed to reflect the project name. Environment is too vague and could be confusing in the long run.

    • Along with Rada's comment, I also could not get the environment to activate! I get the error "Could not find conda environment: environment". It's not listed in the environments I have even though I downloaded the yaml and created the environment
  3. Analysis:"Therefore, we divided the values into five categories: less than $8k, $8k stands for funding size in the range of 8k to 10k in CAD, $12 stands for funding size in the range of 12k to 23k in CAD"

    • There is some inconsistency between the figure legend and the categories stated. In the legend, all the categories are a range, not a single value.
  4. Analysis: _"Therefore, we selected random forest as the best performing model to conduct hyperparameter optimization and tuned max_features, max_depth" and "class_weight as well as the maxfeatures argument in CountVectorizer()."

    • In the "data" and "results" section of this document, it's mentioned that there is some class imbalance. As a reader, I'm curious what the class_weight hyperparameter was set to! I think it would be insightful to state the final hyperparameters used by random forest, and possibly even the range of values tested for these hyperparameters.

5.. Analysis: In general, I think another figure could be added to the "Results and Discussion" section. Perhaps a confusion matrix would be an interesting visualization to add.

  1. This is really really minor, but I initially found the phrase "features non-indicative of artistic merit, such as location, audience, and discipline" a little hard to decipher at first. I feel like it would be clearer to just say what features you're using without the context of non-artistic merit. Maybe in the analysis it could be mentioned that the features are interesting because they are not indicative of artistic merit. Again, super minor, but the phrase came up a couple times so I though I'd mention it!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

artanzand commented 2 years ago

Hi Jasmine, Thank you for your constructive feedback! There are many points that we will definitely be adding. One comments on your review: It is typical for projects to call their environment file environment.yaml as it is self-explanatory. Inside that yaml file, however, you can find the actual name of the environment which is the case for our file. To activate our environment you should type conda activate Cdn_heritage_funding.

Davidwang11 commented 2 years ago

Reviewer: Tianwei Wang

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2

Review Comments:

Overall, it's a great project, good job, Group 14. I like the idea of heritage funding size prediction problem.

Some suggestions:

  1. How do you determine the threshhold for five categories funding size? Maybe you could do some exploring and discuss it in the report in more detail.
  2. Maybe you could add a distribution figure of the funding size.
  3. It's better to provide train score for those models as well, because I am wondering whether these models have overfitting or underfitting problem.
  4. I think it would be interesting if you can carry out feature selection to reduce the number of features instead of just dropping some features. For example, Recursive Features Elimination could be a good start.
  5. As a reader, I would appreciate it if you could add a default value for the input variable in the script.
  6. If you can discuss something about feature importance, it will be more helpful for those readers who want to apply the fund.

Thank you friends! It's a great opportunity to learn from your project.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

xiangwxt commented 2 years ago

[Responses to peer review and teaching staff review comments]

Summary of Addressed Feedbacks

1. Updated variable names in the final report