Milestone 1 Review - Githubissues

Ivyqiuhan commented 2 years ago

Nice job! I provide here some comments and your grades for the first milestone. Please address these concerns in your third milestone submission.

Also try to close the issue that you already finished : )

Draft a Team work contract: Correctness
Project set-up: Mechanics
- Fixing typos
Project proposal: reasoning

"First, we split our data set into train and test splits by 20:80 ratio." Do you mean 80:20 ratio into train test? 20:80 would be a terrible model.
What about data visualization? What specifically are you going to do? Will you do a heat map with correlation to make sure that variables are not redundant?
What about class balance, there could be an imbalance in the classes, in which you would have to under sample or oversample.
What about missing data, how will you handle the missing data?
What about classification models?
For these algorithms, what packages will you use? Have you thought of using wrapper algorithms (boruta algorithm) for feature selection?
" Once we find our best model, we evaluate that on the test set and report the scores and confusion matrix. " What is in your confusion matrix? Accuracy, F1 score? Recall? Precision? And why? A suggestion for metrics, to determine the performance of your models is Area Under Curve (AUC). The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. Also look into and SHAP (SHapley Additive exPlanations) which explains the direction of each variable compared to the outcome variable.
I would avoid creating a new derived variable without any literature backing your rationale, and not having any standard. "In order to avoid overfitting on this unbalanced feature, we define a new variable season, and plot the burned are versus this new variable." In addition what do you mean by unbalanced feature of the month? Data variance, in terms of balance, is a great way for the model to discriminate between variables that should be predictors and variables that are less strong in predicting whether or not an area will have a fire. You are trying to use as many measurements as possible to predict if an area will be burned, this could include things like month. "Surprisingly and contrary to our expectations, the summer fires are not significantly larger than other seasons. " this result itself should be super alarming, since this is not reality.
"We will also have a line plot showing the predicted burned area from our best model versus the real burned area to highlight how well our model performs. " Isn't this just accuracy metric? You don't need to show this, you can just have an accuracy metric.

A script that downloads the data: Accuracy
A script that downloads the data: Quality
Exploratory data analysis in a literate code document: QUALITY

Titles need to make sense.

Exploratory data analysis in a literate code document: VIZ

If figure captions are not provided the plot should be clearly explained in the text. I would recommend using figure captions.

Exploratory data analysis in a literate code document: REASONING

Need to draw conclusions from the heat maps and EDA, what variables are you going to include/exclude and why?

Exploratory data analysis in a literate code document: ACCURACY
Expectations: Mechanics

voremargot commented 2 years ago

We will address this
We have changed this in the proposal to be the correct ratio. Thanks!
In terms of discussing plots in the proposal, paragraph 3 discusses several of the plots and finding from them. If more needs to be added, please clarify what we are missing.
As we are working with a regression problem and not a classification problem, we do not need to be concerned with class imbalance. We do talk about how we are address the skewness of the data which is more relevant to our research question.
We will add in a statement in the proposal but this was thoroughly addressed in the EDA document.
While we thought about doing a classification model determining if there were forest fires or not, we have opted to do a regression model as stated in the proposal.
In the proposal document we will address the packages we are using in the proposal. We will also mention our choice to not use feature selection in our case.
While in the original proposal we did mention a confusion matrix, this was an error as we are not doing a classification problem. We did mention that we are using RMSE for our scoring metric so we will be sure to make this clearer.
In the EDA we observed some months didn't have observations so we wanted to create features to address this. I have expert knowledge in forest fires and know that seasons are a good feature to add thus we were confident in our feature engineering.
We plan to add in this plot as it will easily show the reader how the model performed. We plan to show the error metrics but feel that the results will be more impactful if we show it in a plot.
The document titles are from a workflow described in "Art of Data Science" which is used in this course. We wanted to follow these guidelines which is why we have chosen the section titles. This is cited in the EDA document. (https://leanpub.com/artofdatascience)
All figures have captions. Please clarify where the confusion is.
We feel that the EDA document contains conclusions for many of our plots. Under each plot is an explanation of what we see as well as how we will address the findings. Please specify how we need to make more detailed conclusions.

Ivyqiuhan commented 2 years ago

good work, here's my comment:

you still need to consider class imbalance even for regression model
your figures indeed have titles, that error on my end (I did not deduct points for that)
in your EDA you need to conclude what variables are you going to include/exclude and why in the conclusion section, even though you explained under each figure
I can give you B+ for the proposal section

UBC-MDS / forest-fire-area-prediction

Milestone 1 Review #48