UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 12: Telco Customer Churn Prediction #23

Open adammorphy opened 2 years ago

adammorphy commented 2 years ago

Submitting authors: @adammorphy @jcasoli @zzhzoe @Anupriya-Sri

Repository: https://github.com/adammorphy/Telco_Customer_Churn_Prediction_Group12 Report link: http://htmlpreview.github.io/?https://github.com/adammorphy/Telco_Customer_Churn_Prediction_Group12/blob/main/docs/Telco_Customer_Churn_Prediction_Report.html Abstract/executive summary: In this project, we attempt to examine the following question: Consider certain telecommunications customer characteristics, predict the likelihood that a given customer is likely to churn, and further understand what customer characteristics are positively associated with high churn risk. The dataset we are using comes from the public IBM github page, and is made available as part of an effort (by IBM) to teach the public how to use some of their machine learning tools. We used a logistic regression algorithm to build a classification model to predict which customers are likely to churn from their telecommunications company. Additionally, we reported which features are most positively & negatively correlated with our target, as learned by our model. On test data our model had satisfactory performance. With consideration to the “Churn” class, our model had an f1 score of ~0.63, a recall score of ~0.82, and a precision score of 0.51. These metrics were chosen to compensate for the the class imbalance for the target class. The features most positively correlated with Churn include high monthly charges, month to month contracts, and fiber optic internet service. The features most negatively correlated with Churn include tenure, two year contracts, and DSL internet service.

The dataset we are using comes from the public IBM github page, and is made available as part of an effort (by IBM) to teach the public how to use some of their machine learning tools. Unfortunately no mention is made of exactly how the data was collected, or who was responsible for the collection. Here is a link to the mini-course that references the dataset we are using. The raw data is here, and lives inside the data folder of the public repository for the mini-course. Each row in the dataset corresponds to a single customer. There are 19 feature columns, along with the target column, “churn.”

Editor: @flor14 Reviewer: Mao Lisheng, Gordon Julien, @jamesktkim , @garhwalinauna

nickmao1994 commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

I think the report is particularly good at:

My suggestion on improvements:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

BooleanJulien commented 2 years ago

Reviewer: Julien Gordon -

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.25

Review Comments:

Overall great work. It was a pleasure reading through it and I'm happy that I was assigned to review your group. I learned a lot from your analysis!

I hope my comments can help to improve the outcome.


The code quality and readability was very high and scripts were well organized and functions well named. Minor thing--more testing could be done in scripts to avoid changes breaking things. Maybe at least another test after successfully reading in a csv to make sure the csv data is what you expect. I realize this is nitpicky because overall everything seems very well done.


The research question can be cleaned up:

Here's what you have:

Consider certain telecommunications customer characteristics, predict the likelihood that a given customer is likely to churn, and further understand what customer characteristics are positively associated with high churn risk.

Consider using something like:

Can we construct a machine learning model which predicts telecommunications customer churn likelihood given obtainable customer characteristics? Further, can we identify what characteristics are most important in predicting this churn risk?


Another writing tip in the summary and throughout the report: try not to switch tense in the middle of a paragraph. For example, in the summary you start by using the present tense and then switch to past tense once you start writing about what you did, while this is an intuitive way to write, it is jarring for the reader.

Background is well motivated and to-the-point.

 Minor thing: try to avoid unnecessary parenthesis i.e. ```(by IBM)```

EDA has very logical choices, good stuff. The flow between figure 1 and 2 where you point out the power of tenure was very satisfying.

Unneccessary legend in figure 1

Figure 4 is a bit dicey. I think you are trying to show which categorical classes provide the most differentiation w.r.t. churn. It's hard to tell with the stacked bar chart, as differences in the overall size of the bar make it hard to tell differences in proportion, especially for partner. Perhaps find a different way of showing this.


The results/conclusions/limitations are the sections that need the most work. You've communicated a very technical output of your model, but you need to take the next step and clearly answer your research question. As a reader it's like you took me out for a nice dinner for the majority of the report and then dumped me on the curb right at the end!

What do your f1 and other scoring methods mean in terms of your overall model's ability to predict, in plain language? When would it perform well? Where does it falter?

To address the second part your research question, you especially need to report your most important model components in plain language, not left as variable names in a table, and then discuss as experts of the model what the implications are w.r.t. a telecom company's business perspective.

The limitations need to go beyond next steps. What are the limitations of your model, and especially what are the limitations of your data? Take a look at how and when it was collected, and how that may impact out-of-sample predictions or "in the wild."


That's all from me. Best of luck moving forward!

-Julien

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

rrrohit1 commented 2 years ago

Reviewer: Rohit Rawat

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

It was a great opportunity to go through your repository which gave me enough points to ponder over my own work, especially the scripting portion. My review pertains to your second release (Milestone 2) which can be found here. So, I won't be reviewing your MakeFile that you might be currently working on.

What I really liked about your work:

  1. The EDA notebook is exhaustive and provides a lot of meaningful insights. The flow of the notebook is ordered beautifully, and everything done makes complete sense.
  2. I was pleased to see that there were clear instructions to create, install and load the environment. 3.The try-except for handling directory-related issues is excellent.
  3. The use of functions in the scripts helps in improving readability and understanding your work.

My suggestions:

Related to EDA document:
  1. Comments given in the notebook can be skipped as they are just there for saving the plot.
  2. In the class imbalance figure, for readability, it would be better to put the target classes on the y-axis.
Related to methods used:
  1. Max_iter could also be tuned during hyperparameter optimization.
  2. During GridSearch, the ROC-AUC score would be a better metric than the F-1 score since it gives a better estimate over different thresholds for prediction.
Related to final report & README:
  1. In the final report, Fig 3 and Fig 4 need to be scaled down in size.
  2. What are the reasons for preferring Logistic Regression over DecisionTree, KNN, or RandomForest? I see that it is mentioned in the Limitations & Future section. But there should have been a rationale to choose it over the others.
  3. The citation at the end of the Analysis section should not be there.
  4. Highlight the research question or add it as a separate section if possible.

Some of the above points are minor changes and overall I was satisfied with the report. I found a lot of useful tips on improving my own project's scripts from your project and I am grateful for it.

All the best for the remainder of the project. Best of luck.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

jamesktkim commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hr

Review Comments:

Overall, great work team. The research question pertained to a real-life issue so it was very interesting to review this project.

Some of the aspects that I liked about the project:

  1. Very interesting topic, relevant to real-life issues
  2. Made thorough references in the introduction of the report, not just for the numbers but also for general allegations
  3. Clear explanations of the data preprocessing steps, which were very easy to follow
  4. Including various figures in the EDA to help the audience understand about features
  5. Laying out specific suggestions in the conclusion to better improve the model

Some suggestions for an improvement:

  1. Result section in the report looks very messy, so perhaps add a space between the table and the next block of texts.
  2. You can also round the coefficients for better readability. Especially table4 is impossible to read with all the numbers crammed together in a single table without much white space.
  3. Explore other scoring options to further evaluate the model performance, such as ROC AUC and average precision score.
  4. Lastly I feel like there needs to be a section in the conclusion where you can link the result of the tests with the real life contexts. At the moment there are only reports about numbers that were derived from testing the model.

Again, great work overall, as I learned few things myself from your project to apply to my current and future projects. I hope you guys are having fun with the project and wish you all the best for the rest of this course.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

adammorphy commented 2 years ago

Reviewer: image Gordon -

Conflict of interest

* [x]  As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

* [x]  I confirm that I read and will adhere to the [MDS code of conduct](https://ubc-mds.github.io/resources_pages/code_of_conduct/).

General checks

* [x]  **Repository:** Is the source code for this data analysis available? Is the repository well organized and easy to navigate?

* [x]  **License:** Does the repository contain a plain-text LICENSE file with the contents of an [OSI approved](https://opensource.org/licenses/alphabetical) software license?

Documentation

* [x]  **Installation instructions:** Is there a clearly stated list of dependencies?

* [x]  **Example usage:** Do the authors include examples of how to use the software to reproduce the data analysis?

* [x]  **Functionality documentation:** Is the core functionality of the data analysis software documented to a satisfactory level?

* [x]  **Community guidelines:** Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

* [x]  **Readability:** Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?

* [x]  **Style guidelides:** Does the code adhere to well known language style guides?

* [x]  **Modularity:** Is the code suitably abstracted into scripts and functions?

* [ ]  **Tests:** Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

* [x]  **Data:** Is the raw data archived somewhere? Is it accessible?

* [x]  **Computational methods:** Is all the source code required for the data analysis available?

* [x]  **Conditions:** Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?

* [x]  **Automation:** Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

* [x]  **Authors:** Does the report include a list of authors with their affiliations?

* [ ]  **What is the question:** Do the authors clearly state the research question being asked?

* [x]  **Importance:** Do the authors clearly state the importance for this research question?

* [x]  **Background**: Do the authors provide sufficient background information so that readers can understand the report?

* [ ]  **Methods:** Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?

* [ ]  **Results:** Do the authors clearly communicate their findings through writing, tables and figures?

* [ ]  **Conclusions:** Are the conclusions presented by the authors correct?

* [x]  **References:** Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

* [x]  **Writing quality:** Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.25

Review Comments:

Overall great work. It was a pleasure reading through it and I'm happy that I was assigned to review your group. I learned a lot from your analysis!

I hope my comments can help to improve the outcome.

The code quality and readability was very high and scripts were well organized and functions well named. Minor thing--more testing could be done in scripts to avoid changes breaking things. Maybe at least another test after successfully reading in a csv to make sure the csv data is what you expect. I realize this is nitpicky because overall everything seems very well done.

The research question can be cleaned up:

Here's what you have:

Consider certain telecommunications customer characteristics, predict the likelihood that a given customer is likely to churn, and further understand what customer characteristics are positively associated with high churn risk.

Consider using something like:

Can we construct a machine learning model which predicts telecommunications customer churn likelihood given obtainable customer characteristics? Further, can we identify what characteristics are most important in predicting this churn risk?

Another writing tip in the summary and throughout the report: try not to switch tense in the middle of a paragraph. For example, in the summary you start by using the present tense and then switch to past tense once you start writing about what you did, while this is an intuitive way to write, it is jarring for the reader.

Background is well motivated and to-the-point.

 Minor thing: try to avoid unnecessary parenthesis i.e. ```(by IBM)```

EDA has very logical choices, good stuff. The flow between figure 1 and 2 where you point out the power of tenure was very satisfying.

Unneccessary legend in figure 1

Figure 4 is a bit dicey. I think you are trying to show which categorical classes provide the most differentiation w.r.t. churn. It's hard to tell with the stacked bar chart, as differences in the overall size of the bar make it hard to tell differences in proportion, especially for partner. Perhaps find a different way of showing this.

The results/conclusions/limitations are the sections that need the most work. You've communicated a very technical output of your model, but you need to take the next step and clearly answer your research question. As a reader it's like you took me out for a nice dinner for the majority of the report and then dumped me on the curb right at the end!

What do your f1 and other scoring methods mean in terms of your overall model's ability to predict, in plain language? When would it perform well? Where does it falter?

To address the second part your research question, you especially need to report your most important model components in plain language, not left as variable names in a table, and then discuss as experts of the model what the implications are w.r.t. a telecom company's business perspective.

The limitations need to go beyond next steps. What are the limitations of your model, and especially what are the limitations of your data? Take a look at how and when it was collected, and how that may impact out-of-sample predictions or "in the wild."

That's all from me. Best of luck moving forward!

-Julien

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Hi @BooleanJulien,   Thanks for your feedback! We have made a few changes based on your comments.  

  1. Testing functions have been added to all our scripts to ensure our functions are more robust.
  2. We have updated our research question based on your comments, as well as the recommendations from other comments.
  3. The summary and general writing was updated to standardize the use to present and past tense.
  4. Results/conclusions/limitations all have been significantly expended upon, include better interpretation of coefficients, f1 scores, limitations, and description of the practical business case. Here, we attempted to interpret the results in plain language and added business implications and recommendations on how the results could be leveraged in practice, along with the limitations of doing so.
  5. Fig 4 has been decreased in size.
Anupriya-Sri commented 2 years ago

Hi @nickmao1994 ,

We really appreciate your effort in reviewing this project. We are glad that you linked certain sections and thank you for your suggestions. Please find below our comments on the feedback:

Add npm install -g vega vega-cli vega-lite canvas to environment instruction because I encounter JSON decoder error. Our group has this problem too

We have included the installation for these packages in the Dockerfile. However, we were unable to install npm in the environment for windows, so we have explicitly included the installation instructions in the Readme file.

For every figure and table, adding a caption and a meaningful title (to figures) will be useful when the audience wants a take home message.

Noted. We have updated the captions for the figures to be make them more informative.

The OneHotEncoder and StandardScaler could have been implemented in data preprocessing or in analysis script. Maybe I am missing something but in current version, the features are passed to preprocessor without transformation.

We did not implement the transformations in the data preprocessing script in order to avoid violating the golden rule. The training data set is being used in the analysis script for cross-validation and we transformed the data after splitting between training and validation set.

Hope the above addresses your concerns.

Again, we are grateful for your feedback in helping us improve the project quality.

jcasoli commented 2 years ago

Reviewer: Rohit Rawat

Conflict of interest

  • [x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • [x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • [x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • [x] Installation instructions: Is there a clearly stated list of dependencies?
  • [x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • [x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • [x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • [x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • [x] Style guidelides: Does the code adhere to well known language style guides?
  • [x] Modularity: Is the code suitably abstracted into scripts and functions?
  • [x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • [x] Data: Is the raw data archived somewhere? Is it accessible?
  • [x] Computational methods: Is all the source code required for the data analysis available?
  • [x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • [x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • [x] Authors: Does the report include a list of authors with their affiliations?
  • [x] What is the question: Do the authors clearly state the research question being asked?
  • [x] Importance: Do the authors clearly state the importance for this research question?
  • [x] Background: Do the authors provide sufficient background information so that readers can understand the report?
  • [ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • [ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • [ ] Conclusions: Are the conclusions presented by the authors correct?
  • [x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • [x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

Review Comments:

It was a great opportunity to go through your repository which gave me enough points to ponder over my own work, especially the scripting portion. My review pertains to your second release (Milestone 2) which can be found here. So, I won't be reviewing your MakeFile that you might be currently working on.

What I really liked about your work:

  1. The EDA notebook is exhaustive and provides a lot of meaningful insights. The flow of the notebook is ordered beautifully, and everything done makes complete sense.
  2. I was pleased to see that there were clear instructions to create, install and load the environment. 3.The try-except for handling directory-related issues is excellent.
  3. The use of functions in the scripts helps in improving readability and understanding your work.

My suggestions:

Related to EDA document:
  1. Comments given in the notebook can be skipped as they are just there for saving the plot.
  2. In the class imbalance figure, for readability, it would be better to put the target classes on the y-axis.

This suggestion has been implemented

Related to methods used:
  1. Max_iter could also be tuned during hyperparameter optimization.

  2. During GridSearch, the ROC-AUC score would be a better metric than the F-1 score since it gives a better estimate over different thresholds for prediction.

_It is our understanding that f1 score is better suited to cases where there is class imbalance so we have decided to stick with f1 score as our primary scoring metric. Please see this link for more information on why we made this decision.

Related to final report & README:
  1. In the final report, Fig 3 and Fig 4 need to be scaled down in size.

This has been done as per your suggestion

  1. What are the reasons for preferring Logistic Regression over DecisionTree, KNN, or RandomForest? I see that it is mentioned in the Limitations & Future section. But there should have been a rationale to choose it over the others.

We added clarification as to why we made this decision. See excerpt from final report:

"We decided to use a logistic regression model over other models such as DecisionTree or RandomForest primarily because of our familiarity with the algorithm, and because it is convenient to pull feature importance’s from the fit model."

  1. The citation at the end of the Analysis section should not be there.
  2. Highlight the research question or add it as a separate section if possible.

Some of the above points are minor changes and overall I was satisfied with the report. I found a lot of useful tips on improving my own project's scripts from your project and I am grateful for it.

All the best for the remainder of the project. Best of luck.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thanks for your thorough review Rohit! I left comments above as to how we addressed your suggestions.

zzhzoe commented 2 years ago

Hi @jamesktkim, Thank you for providing valuable feedback so that we can improve our project. We have made the following changes to your comments.

Result section in the report looks very messy, so perhaps add a space between the table and the next block of texts. Add a space between the table and the next text block, make our reports easier to read.

Lastly I feel like there needs to be a section in the conclusion where you can link the result of the tests with the real life contexts. At the moment there are only reports about numbers that were derived from testing the model.

Thanks for pointing this our and to address your comment on the lack of reference to real life application, we will add the following text in the conclusion: "Despite limitations and room future improvements, our result shows a model that can be utilized by telecommunication companies to better predict their customer's churn rate if they have information on respective customer characteristics. This model has significant downstream impact as it can help management reduce acquisition cost and increase customer retention rate."

Thanks again for your comment. Please let us know if you have more questions or concerns.

Anupriya-Sri commented 2 years ago

Considering the feedback received on the project, we have made the following changes:

_Why are there two repeated slides "telco_churnpipeline"

Link to Commit: https://github.com/UBC-MDS/Telco_Customer_Churn_Prediction_Group12/commit/114c21e324f0443ea24ffa421f4e63d2d0a3515b One of the slide had the editable version for future changes and other was an image to be included in any report. However, we have taken your feedback and moved the editable version to /src/ and retained the final image in /docs/

There are also two reports, that are duplicated.

Link to Commit: https://github.com/UBC-MDS/Telco_Customer_Churn_Prediction_Group12/commit/06b746ac29611a52f09852ff61a8c5aabde7e1fd The second report was a pandas profile report created as a part of EDA. However, we have already covered all this information in the eda_notebook, so, we are deleting this file. (06b746a)

Improve conclusion for better understanding:

Link to Commit: https://github.com/UBC-MDS/Telco_Customer_Churn_Prediction_Group12/commit/622dbd3ad68a625db40a3c21a88e5d5fbba870c9

Caption for the figures to reflect the key message

Link to Commit: https://github.com/UBC-MDS/Telco_Customer_Churn_Prediction_Group12/commit/6307c85d54e83ac0e2566bebb733ad644205b314

Add vega-lite-cli installation details as some groups had to separately install this

Link to Commit: https://github.com/UBC-MDS/Telco_Customer_Churn_Prediction_Group12/commit/2a59bca878cef92a56975581e08e80f9c5f1d987