Milestone 1 feedback - Githubissues

andytai7 commented 2 years ago

2. Project set-up: Mechanics

Comments Where is the release?

3. Project proposal: reasoning

Comments What packages will you use for EDA what types of methods? What about class balance, there could be an imbalance in the classes, in which you would have to under sample or oversample. Which one will you utilize?

What about missing data, how will you handle the missing data?

For these algorithms, what packages will you use? Have you thought of using wrapper algorithms (boruta algorithm) for feature selection?

Will you do cross-validation?

A suggestion for metrics, to determine the performance of your models is Area Under Curve (AUC). The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. Also look into and SHAP (SHapley Additive exPlanations) which explains the direction of each variable compared to the outcome variable.

5. Exploratory data analysis in a literate code document: VIZ

Comments I don't understand the graphs you created, you need an explanation for all of them, as well as a clarification of all the variables. IF this is not done the EDA, which is to inform of what variables are key, and some preliminary hypothesis is useless. In addition, EDA should be able to inform you on how to do some data wrangling and i don't see those conclusions.

5. Exploratory data analysis in a literate code document: REASONING

Comments Why are you using some of those EDA techniques? What's the purpose of a heat map etc.? Variables lists are useless unless you define what the variables represent. Needs higher resolution.

jcasoli commented 2 years ago

Thanks for your comments @andytai7! In regards to the project proposal, I wasn't sure whether to make changes in the proposal or just ensure that we address them in the final report. I decided to go with the later option.

Here are some specifics:

What packages will you use for EDA what types of methods?

We made mention in our final report that altair was used for EDA

What about class balance, there could be an imbalance in the classes, in which you would have to under sample or oversample. Which one will you utilize?

We mention in the final report (in the EDA section) that we do have class imbalance, and therefore we decided to optimize the class_weight hyperparameter of LogisticRegression. It turns out that setting class_weight="balanced" does improve model performance.

What about missing data, how will you handle the missing data?

We used a SimpleImputer, though there wasn't much missing data

Will you do cross-validation?

Yes, we performed cross validation as part of our hyperparameter optimization

andytai7 commented 2 years ago

Thank you for addressing these comments!

Best, Andy

Anupriya-Sri commented 2 years ago

Hi @andytai7 ,

In addition to the response from @jcasoli , please note the following below:

Where is the release?

The release is in the github repo under the Releases tab. It was already a part of the Milestone-1 submission: https://github.com/UBC-MDS/Telco_Customer_Churn_Prediction_Group12/releases/tag/Milestone1

Why are you using some of those EDA techniques? What's the purpose of a heat map etc.? Variables lists are useless unless you define what the variables represent. Needs higher resolution.

The detail explanations for each plot and additional analysis, including Pandas profiling, were covered in the EDA notebook. In the final report, we have captured the important analysis. We have tried to use easy to interpret techniques and self-explanatory variable names, such as Total Charges. However, we will look into these again to see if something is not clear.

Hope that clarifies.

andytai7 commented 2 years ago

Thank you @Anupriya-Sri for more clarification.

Best, Andy

Anupriya-Sri commented 2 years ago

Discussed and closed.

UBC-MDS / Telco_Customer_Churn_Prediction_Group12

Milestone 1 feedback #51