Milestone 1 Feedback - Githubissues

andytai7 commented 2 years ago

3. Project proposal: reasoning Comments What sort of EDA will you do? What types of plots? Why? Any hypothesizes? What about class balance? There could be an imbalance in the classes, in which you would have to under-sample or oversample. Which one will you utilize?

What about missing data? How will you handle the missing data?

Why only use linear regression? Have you thought of using wrapper algorithms (boruta algorithm) for feature selection?

Will you do cross-validation?

What about metrics? That was not touched upon in the project proposal. A suggestion for metrics to determine the performance of your models is Area Under Curve (AUC). The Area Under the Curve (AUC) measures the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. Also, look into and SHAP (Shapley Additive exPlanations), which explains the direction of each variable compared to the outcome variable.

5. Exploratory data analysis in a literate code document: VIZ

Comments If figure captions are not provided, the plot should be clearly explained in the text. I would recommend using figure captions: missing legends and X and Y-axis labels.

This needs a lot more work to flesh out why you are doing certain EDA and data visualization. In addition, these reasons should be answered and informed how to proceed in terms of methods (ex., data transformation, data cleansing).

5. Exploratory data analysis in a literate code document: REASONING Comments The rationale is acceptable, but I don't know what plots relate to what.

shivajena commented 2 years ago

Thank you @andytai7 for taking time to write a detailed review of our milestone 1 submission, really appreciate it. We as a team are learning a lot from this group project, along with the collaborative practices. However, we have difficulty understanding some of your observations which we would need clarifications or rather insist on closer observations from your side as well as ours to address the gaps: 1. Project proposal {reasoning}: What sort of EDA will you do? What types of plots? Why? Any hypothesizes?

Our project answers a prediction question: predicting giant pumpkin weights, which we have very explicitly stated in our project proposal in Readme file. We have spoken about our approach of data preparation for EDA, some preliminary as well as important EDA observations and the link to detailed EDA report. For eg., we have mentioned about distributions of some of the attributes which could be our potential features, along with input and output forms as per the rubrics indicators of milestone 1. Therefore, we request you to have a relook into this.

What about class balance? There could be an imbalance in the classes, in which you would have to under-sample or oversample. Which one will you utilize?

Reiterating, we are dealing with a prediction problem of a continuous variable - giant pumpkin weight. This is not a classification problem, and hence, we are not sure about what exactly did you want to convey on class balance.

What about missing data? How will you handle the missing data?

Missing data handling is very clearly mentioned in our EDA report, and just an indication of data preparation is given in the project proposal since milestone 1 was not a detailed deployment of model where we needed to explore on missing data strategy (which we have indeed done in milestone 2 as desired).

Why only use linear regression? Have you thought of using wrapper algorithms (boruta algorithm) for feature selection?

We appreciate this suggestion, and we will plan on implementing wrapper algorithms in the future as we learn it in 573 but we haven’t till now. As of submissions till now, we have covered the various types of regression models in our project. However, the number of features are only limited in our data set(4-5 features only), and therefore, feature selection algorithms may not be very relevant in our context. Rather, feature engineering could be the key but we are trying our best to come up with others such as polynomial features or any other interesting learnings from 573.

Will you do cross-validation?

Yes, and infact we have touched upon this as well as the method of using pipe operators along with hyperparameter optimisation in the predictive modelling section of our proposal. Request you to kindly have a relook.

What about metrics? That was not touched upon in the project proposal. A suggestion for metrics to determine the performance of your models is Area Under Curve (AUC). The Area Under the Curve (AUC) measures the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. Also, look into and SHAP (Shapley Additive exPlanations), which explains the direction of each variable compared to the outcome variable.

Again, we are answering a prediction question and ROC-AUC are classification metrics, not regression. For regression context, we have explicitly mentioned metrics such as R-square score and accuracy as our initial scoring metrics. Request you to have a relook.

5. Exploratory data analysis in a literate code document: VIZ Comments If figure captions are not provided, the plot should be clearly explained in the text. I would recommend using figure captions: missing legends and X and Y-axis labels

We have provided figure captions and highlighted important observations in the our EDA report. We can discuss more on the specifics if needed.

This needs a lot more work to flesh out why you are doing certain EDA and data visualization. In addition, these reasons should be answered and informed how to proceed in terms of methods (ex., data transformation, data cleansing).

For prediction problems where number of features are less, data transformation and cleaning are very important and that is why we have mentioned detailed observations on these aspects in the data summary part of our EDA report. Had it been an inferential question or a prediction problem with large number of features, that case would have been perfect to incorporate specific EDA along with explanations. But it is not applicable as we understand for our project given such limit amount of features. 5. Exploratory data analysis in a literate code document: REASONING Comments The rationale is acceptable, but I don't know what plots relate to what. The one thing we were thoroughly focussing on is, observing any association between pumpkin weight and the features which were mostly categorical and very few numerical. This is what we have explained in the EDA report.

Given all of the above comments, we request you to have a relook into our proposal and EDA and if possible, have a regrade on these sections.

andytai7 commented 2 years ago

Hi @shivajena

Thank you for your comments!

For future reference, all the TA's will include a "suggestion" section, which is basically where we give suggestions that do not follow the basic rubric. As you may know, these suggestions help with students' brainstorming and further development of their projects. Please keep this in mind when reading some of these comments you disagree on, as your group may not have gotten ANY marks off for the suggestions I have given. Herein, I try my best to clear up some of my comments.

"Our project answers a prediction question: predicting giant pumpkin weights, which we have very explicitly stated in our project proposal in Readme file. We have spoken about our approach of data preparation for EDA, some preliminary as well as important EDA observations and the link to detailed EDA report. For eg., we have mentioned about distributions of some of the attributes which could be our potential features, along with input and output forms as per the rubrics indicators of milestone 1. Therefore, we request you to have a relook into this."

For this section, i believed that the idea was not fleshed out enough. When discussing the specific EDA observation, it was not linked to produced figures, which was confusing.

"Reiterating, we are dealing with a prediction problem of a continuous variable - giant pumpkin weight. This is not a classification problem, and hence, we are not sure about what exactly did you want to convey on class balance. "

When dealing with a continuous variable and regression, there are instances when an imbalance in the dataset still occurs. For reference, please check out these links! https://towardsdatascience.com/regression-for-imbalanced-data-with-application-edf93517247c https://link.springer.com/article/10.1007/s10994-020-05900-9

"Yes, and infact we have touched upon this as well as the method of using pipe operators along with hyperparameter optimisation in the predictive modelling section of our proposal. Request you to kindly have a relook."

I have looked into the group's milestone1 release. Unfortunately, I cannot find your proposal in this release, and I can only see a pumkins_eda PDF that does not touch upon hyperparameter optimization and a pumpkin.rmd (which seems to be as though it hasn't been worked on and is looking into the cars package). The PDF does touch upon EDA in a more detailed format, but again, the figures are missing captions and are not labelled in a way that is touched up in the discussion section. To summarize, the structure of the documents for milestone 1 is a mess, and I can't find what I need to quickly, which is a big problem. Please see attached screenshot.

"Again, we are answering a prediction question and ROC-AUC are classification metrics, not regression. For regression context, we have explicitly mentioned metrics such as R-square score and accuracy as our initial scoring metrics. Request you to have a relook."

This was a suggestion, and the ROC-AUC method does not need to be utilized. I highly recommend looking into a logistic regression, a classifier, to diversify the utilization of different prediction models.

"We have provided figure captions and highlighted important observations in the our EDA report. We can discuss more on the specifics if needed."

When I am looking into your PDF file, there are no figure captions for your EDA. In addition, in your EDA report, you have not referenced any figures. Please see attached screenshot for reference.

"For prediction problems where number of features are less, data transformation and cleaning are very important and that is why we have mentioned detailed observations on these aspects in the data summary part of our EDA report. Had it been an inferential question or a prediction problem with large number of features, that case would have been perfect to incorporate specific EDA along with explanations. But it is not applicable as we understand for our project given such limit amount of features."

I suggest utilizing a heat map.

I have noticed that you had mentioned your proposal. I don't see this document, even in your milestone 2. Thank you for your hard work, and i hope I shed light on some of the confusion this group might have.

shivajena commented 2 years ago

Thanks Andy, its much clearer now. We will discuss and implement the ideas. Actually, we thought proposal was to be written in the readme as per instructions in milestone 1. But we do get your point, we will try to be bit more explicit in the readme to indicate our proposal plan - may be proposal section. Rest others, we will definitely discuss among our group and update you on those. Really appreciate for the time taken

imtvwy commented 2 years ago

@andytai7

I have noticed that you had mentioned your proposal. I don't see this document, even in your milestone 2.

Sorry Andy, we though the proposal is to be written in the README.md instead of in another document as suggested in the Milestone 1 instructions here.
https://pages.github.ubc.ca/MDS-2021-22/DSCI_522_dsci-workflows_students/materials/assignments/milestone1.html "Note 2: the proposal should be written in the README.md file in the root of your public GitHub.com repo."

The empty pumpkin.Rmd is a just a placeholder for the final report in the first milestone. This is just to show our proposed project structure. And it should have mentioned in the README.md as well.

Please advise if we have to create another proposal document.

andytai7 commented 2 years ago

No this should be fine. Thank you.

UBC-MDS / Giant_Pumpkins_Weight_Prediction

Milestone 1 Feedback #39