UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: Group 3: customer_complaint_analyzer #7

Open lukeyf opened 1 year ago

lukeyf commented 1 year ago

Submitting authors:  @tieandrews @lukeyf @dhruvinishar

Repository: https://github.com/UBC-MDS/customer_complaint_analyzer Report link: https://ubc-mds.github.io/customer_complaint_analyzer/reports/final_report.html Abstract/executive summary: We aim to investigate, analyze, and report using the customer complaint dataset. This dataset is published in DATA.GOV and it is intended for public access and use. This dataset is a collection of customer complaints regarding their purchased financial products. It contains information on the summary and content of the complaint, the responses from the companies, and whether the customer disputed after companies response. We aim to answer the following inferential and/or predictive questions: Can we predict whether a customer is going to dispute based on their complaint and the company's response?

We plan to analyze the data using a mix of tabular and natural language processing tools like the bag-of-words representation and apply proper categorical transformations to the company's responses. The data were transformed and analyzed using 5 different models using cross-validations and the recall score was selected for model evaluation. The results were presented in the report file and the logistic regression model was selected as the best model due to its high performance and interpretability.

Editor: @flor14 Reviewer: Dhillon Robin, Chan Morris, Zaiatc Stephen

robindhillon1 commented 1 year ago

Data analysis review checklist

Reviewer: robindhillon1

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Review Comments:

Suggestions

image

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

morrismanfung commented 1 year ago

Data analysis review checklist

Reviewer: @morrismanfung

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2

Review Comments:

  1. The report rendered is professional. All of the visualizations in the repo are excellent.

  2. Minor issues. Link to "flow" is not usable in https://github.com/UBC-MDS/customer_complaint_analyzer/blob/main/CONTRIBUTING.md#regarding-pull-requests-please-consult-this-flow-so-that-all-code-changes-are-made-through-pull-requests. In Table 2, 1 feature is written as "dropped". It seems that more features (e.g., state, zip_code) were actually dropped actually?

  3. I am not sure whether using cross-validation per se is sufficient, without including the results with the testing set. If you are not doing hyperparameter tuning, using cross-validation seems to be equivalent to doing 5 train-test-splits. Right, its not breaking the golden rule. As I saw that you included testing data in the analysis script, the only limitation I can think about right now is that the readers may not understand the flow perfectly as we all expect some testing data set for performance evaluation. If you want to stick with cross-validation only (which may not be the case), it will be cool if you can show the variance in different folds.

  4. Using a bar chart to represent class imbalance is effective. While around 75% of the entries are with unknown status. I personally would like to see explicitly how your team deal with those entries (dropping them, right?).

  5. I think setting a particular model as a basis for future optimization is a nice idea to have. I want to suggest another simpler model: Consider a model which classifies all complaints as potentially disputed. It will have a recall of 100%, a precision of around 19.5%, and an f1 of 16.3. Yes the f1 is lower but the recall is much better than other models, with a little scarification in precision. Using this standard, maybe logistic regression is not that good...?

  6. The reason why I didn't check the the box for conclusion is that I think it is too early to conclude logistic regression is suitable. First, without hyperparameter and threshold tuning, it is likely that other models can outperform logistic regression. Without these steps, it could be too optimistic or pessimistic to conclude a model is suitable or not.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

stepanz25 commented 1 year ago

Peer Review

Data analysis review checklist

Reviewer: Stepan Zaiatc @stepanz25

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. The use of different styles of graphs and charts is effective in communicating trends of the subject matter.
  2. I found that the folder structure and organization of your repository is clear; however, in the data folder I couldn’t find any file that would contain your archived data sets. The makefile executes well, it’s very well documented with all the necessary details to make the code more reproducible.
  3. The source code is generally easy to read. There are sufficient comments, and variable names are meaningful. Just do not forget to include name and dates of who written the script in the documentation (e.g. generate_eda.py).
  4. So a model has been created using data from the CFPB database. As a suggestion, maybe it would be interesting to see how the model behaves on data from a different source, or even only with data from the same source, but just different date ranges.
  5. I think it is important for the developer to specify the conditions that the training data was collected in, ie. which states, the language the complaint was submitted in, etc., as the algorithm may not work in conditions where these factors are different. Therefore, maybe it will be worth including some comments on this in your final report.
  6. I suggest proofreading the report again before submitting the final version, as there are numerous typographic and grammatical errors, like missing a space between a word and a bracket, missing apostrophes, run-on sentences, subject-verb disagreements, unnecessary capitalization of words, improper pluralization.
  7. Figure 1 can be improved with a larger text size that’s similar to the body text size, avoid using words that may appear as commentary in the title, and simplifying the x-axis label to just “Year”.
  8. Figure 2 can be improved by flipping the axes, as it is customary to place the independent variable (yes/no) on the x-axis. And similar to figure 1, a larger text size would assist the reader in reading the graph.
  9. For table 1, its title is “unique and missing value counts…”; however, the table contains a “Unique Count” column, but not a “Missing Value Count” column. Numeric values should use comma separators to make them easier to read, and I suggest adding an extra column to explain what each field means, as the field names are often cryptic to the reader.
  10. For table 3, I would suggest reordering the columns such that they are in the same order as the list in the preceding paragraph. Similarly, the order of the graphs in Figure 3 should be adjusted as well.
  11. In figure 3, I feel that the legend is redundant as each colour is already labelled on the x-axis of each graph. Moreover, the text size can be made larger to improve readability.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

lukeyf commented 1 year ago

Thank you to @stepanz25, @morrismanfung, @robindhillon1, and TAs for all your helpful feedback! All your feedback points were valuable and we hope we can address them all in the future. Due to the time constraint, here are a few points highlighting the feedback we have addressed:

  1. Write 3-5 tests and document what they're testing

Each of our team members has included and documented a few tests in the tests directory. The tests were written using pytest syntax and should be easily adapted in various IDEs like vscode and Jupiter.

The commits from the team members for all the tests are in: https://github.com/UBC-MDS/customer_complaint_analyzer/pull/70, https://github.com/UBC-MDS/customer_complaint_analyzer/pull/76 and https://github.com/UBC-MDS/customer_complaint_analyzer/pull/79.

  1. Report results of the model on the test set in the final report

We have included the test scores in addition to the validation scores in the final report. The change in the analysis file and report quarto file can be found at https://github.com/UBC-MDS/customer_complaint_analyzer/pull/73.

  1. Generate EDA report with figure captions & no-code inputs

For the EDA part of this project, we correctly formatted the figure captions and formats. The corresponding change can be found at https://github.com/UBC-MDS/customer_complaint_analyzer/pull/78.

  1. Update save_chart instead of saving method to improve reproducibility

One of the feedback from Milestone 2 was the makefile did not automatically run to the end. We debugged that this is because the chart saving was not correctly done to all platforms. We update this in https://github.com/UBC-MDS/customer_complaint_analyzer/pull/56 for save_chart functionality to make the make process robust at https://github.com/UBC-MDS/customer_complaint_analyzer/pull/56.

  1. Put author and date into generate_eda.py

Some of our scripts were not documented appropriately. We add the author and date in this commit https://github.com/UBC-MDS/customer_complaint_analyzer/pull/78/commits/cd0427f83f0c45f2324233d79cbfd2a3cd3e618a.

  1. Fix up gitignore, exclude unneeded files

We had unused files like .vscode and .Rproj files in our repository. We added into gitignore also in this pull request https://github.com/UBC-MDS/customer_complaint_analyzer/pull/78/commits/4d944def7ffa69003d6422af6db62b282668e745.

Thanks again for the amazing comments!