ahofmann4 commented 3 years ago

Hi Group 12!

Your code was well written, and it seems like you did a very thorough eda analysis.

However, there are a lot of gaps in your reasoning in your Report which lead to a lot of questions. Below is a summary of comments and suggestions from both Tiff and I:

Figures: 1) All figures are missing axis labels 2) Figures take up more space than needed to communicate the information (plot area should be reduced, not font size), this makes it difficult to digest the plot on a laptop, and while reading the surrounding text. Also, they should label plots with human-sensical class labels instead of 0 and 1. Also, what does 1, 2 and 3 mean for education levels? 3) I do not know what the 0,1 designations mean. Is 0 the person defaults on the loan or is that 1? 4) The distribution plots are very overlapping and my gut says they would not be statistically different, and I would want to see some statistics done to confirm their claim that people with higher card limits are more likely to default 5) The next plots, I have no idea what the numbers across the top of the graph are, and the claim is that there is a correlation between education and default payment. If that is shown in the diagram as plot 4,5 and 6. That is such a small part of the data set that I don’t think it is convincing or supporting your claim

Data: 6) Class imbalance? I’m not sure where they have shown me this. might be in figure 2. How bad is the class imbalance actually? 7) Make it clear in your report which scores are from training, validation, cross validation and test sets. 8) I do not think you have addressed Arjun’s comments from last week very well. Some a little, but having only descriptions of the data in the eda and not in the Report does not help me interpret their graphs in the report. 9) link to the report in the README

Overal reasoning: 10) The conclusion is not that satisfactory. You might want to comment on whether other data might benefit the model? If so, what kinds of other data? You mention feature engineering generally, what ideas do you have? Or perhaps do you think that this is a really difficult prediction task that might almost never be able to be well predicted?

larahabashy commented 3 years ago

Class imbalance? I’m not sure where they have shown me this. might be in figure 2. How bad is the class imbalance actually?

In EDA report: Before splitting the data set into training (75%) and testing (25%) sets, we inspect class balance to detect any imbalance in the target class. Proportions of defaulting clients in the data is 22.21%.

For Milestone 4, I added a proportions plot to the EDA report, as well as the eda script.

I do not think you have addressed Arjun’s comments from last week very well.

Please refer to the issue.

larahabashy commented 3 years ago

Team, I will be addressing the figures feedback shortly. Please have a look at 7, 9, and 10.

larahabashy commented 3 years ago

All figures are missing axis labels

This comment was addressed in Milestone 3 and all figures were further beautified for milestone 4.

Figures take up more space than needed to communicate the information (plot area should be reduced, not font size), this makes it difficult to digest the plot on a laptop, and while reading the surrounding text.

I have adjusted the figure sizes.

Also, they should label plots with human-sensical class labels instead of 0 and 1.

I have labelled the plots with human-sensical class labels.

Also, what does 1, 2 and 3 mean for education levels?

In EDA report: The education feature takes on one of 7 numeric values representing a given client record education level. An education level with value 1 is assigned for clients with graduate degrees, 2 for bachelors degrees, 3 for high school, 4 for other education levels (up to high school) with 5 and 6 as undefined. There is no definition for education level 0. We see that only 14 out of 30,000 observations correspond to clients with education level 0.

I do not know what the 0,1 designations mean. Is 0 the person defaults on the loan or is that 1?

In EDA report: The target variable, default_payment_next_month takes on a value of 1 to indicate the client's payment is likely to default next month and 0 indicates non-defaults.

@HazelJJJ Could you please update the README file, and final report to include this information?

The distribution plots are very overlapping and my gut says they would not be statistically different, and I would want to see some statistics done to confirm their claim that people with higher card limits are more likely to default

No such claim should have been made. The point I was trying to make is that defaulting clients have a higher proportion of lower credit card limit balances. Time permitting, I will test this using a hypothesis test.

@d-sel Do you think we can make a force plot to see which features are impacting predictions the most?

The next plots, I have no idea what the numbers across the top of the graph are, and the claim is that there is a correlation between education and default payment. If that is shown in the diagram as plot 4,5 and 6. That is such a small part of the data set that I don’t think it is convincing or supporting your claim

I am not making a claim here, simply an observation -- Non-defaulting clients have a higher proportion of more highly educated people. The plot has been updated to include labels so that the observation is more clear.

d-sel commented 3 years ago

9: link to report has been updated in README

d-sel commented 3 years ago

@larahabashy

The distribution plots are very overlapping and my gut says they would not be statistically different, and I would want to see some statistics done to confirm their claim that people with higher card limits are more likely to default

No such claim should have been made. The point I was trying to make is that defaulting clients have a higher proportion of lower credit card limit balances. Time permitting, I will test this using a hypothesis test.

@d-sel Do you think we can make a force plot to see which features are impacting predictions the most?

I have created a confusion matrix to help summarize our findings. However, due to complexity in implementation, I have noted the force plot as a future item to add. It will take additional time for the force plot/most important features because we need to refer back to the pre-processing script.

HazelJJJ commented 3 years ago

Data:

6) Class imbalance? I’m not sure where they have shown me this. might be in figure 2. How bad is the class imbalance actually?

I mentioned that in the Data part of the report with the percentage of each class in the target as well as the metrics we choose for assessment.

7) Make it clear in your report which scores are from training, validation, cross validation and test sets.

8) I do not think you have addressed Arjun’s comments from last week very well. Some a little, but having only descriptions of the data in the eda and not in the Report does not help me interpret their graphs in the report.

Arjun's comments 1: are there multiple rows with the same individual? Or is the data for each individual in one row. How many time periods are there. Does each individual have the same number of time periods?

Addressed this in Data part "There are 30,000 observations of distinct credit card clients in this data set with each row representing a unique client." and Analysis part “In terms of numerical features, each individual have the same 6 month time periods for bill statements and previous payment monthly features measured in dollar amounts as well as the history of past payment PAY_0, PAY_2...PAY_6 that representing the depay of the repayment in months. ”

Arjun's comments 2: Why choose a random forest model for feature importance? You should probably compare the model to other ones, like logistic regression. Also what does the feature importance metrics mean in the context of random forest models or logistic regression? Which feature importance measure will you use for random forest? Random forest can also be used as a classifier which is why I am a little confused. Will you compare this to EDA in anyway?

We decide to try both Random Forest and Logistic Regression on our training data and pick the better one to do hyperparameter tuning. We will only briefly talk about feature importance in the EDA report, and we are not planning to analyze it in details in our model and report.

10) The conclusion is not that satisfactory. You might want to comment on whether other data might benefit the model? If so, what kinds of other data? You mention feature engineering generally, what ideas do you have? Or perhaps do you think that this is a really difficult prediction task that might almost never be able to be well predicted?

We re-write the result and conclusion to address this issue.

UBC-MDS / DSCI522_group_12

Milestone 2 General Feedback #47

9: link to report has been updated in README