Open artanzand opened 2 years ago
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
Good job friends! I really like this project. This is the kind of data science project that makes the world better in a little way :) Your repo is well put together and easy to follow. Pleasure to review. Here's my observations and suggestions:
Summary:
I would love to see it expanded a little bit. When you jump into talking about how model underperformed with classifying certain funding sizes, I don't have much on the way of frame of reference. Would be great to have more of an overview, including what categories we are looking at, to be able to process this information.
Curious that there is no difference in funding year over year
Love the pipeline analysis! I would be curious to see some other ensemble models we learned recently. Ensembles seems like the way to go since we have small number of features.
Intro:
Minor quip:
References can be formatted a little bit nicer
Paragraph explaining how you selected the categories in the Methods->Data can use a little polishing. I don't think you are using the labels you mention in it anymore
Review highlighting: you have many places where it highlights beyond specific segments
Modeling:
Have you considered tuning alpha hyperparameter?
You got higher testing score than training score? Interesting. Maybe comment on that?
Scripting:
I can't seem to activate the environment
Your processed data folder isn't empty in the repo
Model Selection Script outputs a lot of warnings in the console: Are they useful or should they be suppressed?
All scripts ran! Beautifully done :)
This was derived from the JOSE review checklist and the ROpenSci review checklist.
This such an interesting dataset and thoughtful question! Overall, the project is well structured and well executed. I really enjoyed learning about Canadian Heritage funding :) Also, I really appreciate the M-1 instructions ~ my group ran into this issue as well and I can see it being a frustrating issue for interested contributors.
A few suggestions:
Code of Conduct: "We can be reached by contacting UBC MDS program" is too vague. Someone who wants to reach out might be confused on who to contact, especially since UBC and MDS is so large (and cohorts change every year!). Explicitly add an email address as a point of contact to eliminate any confusion
Conda environment: The environment name should be renamed to reflect the project name. Environment is too vague and could be confusing in the long run.
Analysis:"Therefore, we divided the values into five categories: less than $8k, $8k stands for funding size in the range of 8k to 10k in CAD, $12 stands for funding size in the range of 12k to 23k in CAD"
Analysis: _"Therefore, we selected random forest as the best performing model to conduct hyperparameter optimization and tuned max_features, max_depth" and "class_weight as well as the maxfeatures argument in CountVectorizer()."
5.. Analysis: In general, I think another figure could be added to the "Results and Discussion" section. Perhaps a confusion matrix would be an interesting visualization to add.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Hi Jasmine,
Thank you for your constructive feedback! There are many points that we will definitely be adding. One comments on your review:
It is typical for projects to call their environment file environment.yaml
as it is self-explanatory. Inside that yaml file, however, you can find the actual name of the environment which is the case for our file. To activate our environment you should type conda activate Cdn_heritage_funding
.
Reviewer: Tianwei Wang
Overall, it's a great project, good job, Group 14. I like the idea of heritage funding size prediction problem.
Some suggestions:
Thank you friends! It's a great opportunity to learn from your project.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
CountVectorizer
and RandomForestClassifier
in the original report, and have fixed this issue.
Submitting authors: @artanzand @aimee0317 @jo4356 @xiangwxt
Repository: https://github.com/UBC-MDS/canadian_heritage_funding Report link: https://htmlpreview.github.io/?https://github.com/UBC-MDS/canadian_heritage_funding/blob/main/doc/canadian_heritage_funding_report.html
Abstract/executive summary:
We attempt to build a multi-class classification model which can use features not indicative of artistic merit, such as location, audience, and discipline to predict the funding size granted by the Canadian Heritage Fund (The Fund). We used four popular algorithms for classification questions, including logistics regression, Naive Bayes, C-Support Vector Classification (SVC) and Random Forest initially. We also used dummyclassifer as a base case. Then, we selected Random Forest as the best algorithm for our question based on each model’s cross-validation scores. We then further conducted hyperparameter optimization on the Random Forest model. Our model performs reasonably well comparing to the base case dummyclassifer with a macro average f-1 score of 0.68 and a weighted-average f-1 score of 0.68. However, we also observed that the model performs worse at classifying funding sizes in range of $12-23K and $23-50Kcomparing to classifying funding sizes of other ranges. Thus, we suggest further study to improve this classification model.
Editor: @artanzand @aimee0317 @jo4356 @xiangwxt
Reviewer: Rudyak_Rada, WANG_Tianwei, ORTEGA_Jasmine