LeDataSciFi / FinTech-Capstone-2023

https://ledatascifi.github.io/FinTech-Capstone-2023/
3 stars 0 forks source link

Descriptive Stats change #19

Closed jum223 closed 1 year ago

jum223 commented 1 year ago

@donbowen Just finished with the changes to the tables in descriptive stats file. I think that they look much better now, let me know if you like the new format of the tables. I am doing the pull request now because there were some people interested in looking at the tables to analyze them in the paper.

I have not yet changed the regression file, but I will now get onto trying to replicate what we talked about after class.

jum223 commented 1 year ago

@donbowen just updated the bank_tract dataset to include all of the tracts for both which_bank values (BoW) and competitors, I think that the code works. But now I am unsure on how to proceed with the regressions. Could you again remind me of what rows would be necessary to run the regressions?

donbowen commented 1 year ago

I reviewed your code.

  1. It seems like the dataset had 3 tracts where ONLY BoW operated? No other banks! ... Unexpected. Ok.
  2. I'd rename bank_tract_new1 something like bank_tract_full and add comments/markdown so it's clear why we made this change.
  3. To run the regressions:
    1. Before the regression block of code, bank_tract_full['tract_denial_count'] = bank_tract_full.groupby('census_tract')['denial_count'].transform(count)
    2. Then in the regressions, model 1 and 2 is data=bank_tract_full.query('tract_denial_count == 2')
jum223 commented 1 year ago

@donbowen just did the updates you mentioned above, I also decided on more control variables to use for the regression models you can see it in the new push I just did. I didn't include them straight away because I wanted to ask if I should do any winsorizing/removing outliers or other data preprocessing steps before I include them in the regression models.

donbowen commented 1 year ago

Models 3 and 4 don't have enough observations. It means some of your variables in the regression are missing.

Check:

bank_tract_full[['log_num_apps','hisp_rate','hisp_over_med','which_bank','median_all_income','mean_LTV']].count()
jum223 commented 1 year ago

I get the following output:

log_num_apps         18680
hisp_rate            18676
hisp_over_med        18680
which_bank           18680
median_all_income    13435
mean_LTV             13833

Hispanic rate will have less than 18680 observations because there are probably tracts with no hispanic population, so no hispanic rate. Regarding median_all_income and mean_LTV, these have less observations because the new rows we created have np.NaN values for the bank-specific variables, am thinking that the solution of this would be to populate these variables with 0s if there is a np.NaN value.

donbowen commented 1 year ago

Hispanic rate will have less than 18680 observations because there are probably tracts with no hispanic population, so no hispanic rate.

Fill with 0.

Regarding median_all_income and mean_LTV, these have less observations because the new rows we created have np.NaN values for the bank-specific variables, am thinking that the solution of this would be to populate these variables with 0s if there is a np.NaN value.

This doens't work here. If you fill with 0, you need to also include a boolean variable = 1 in rows where you did that fill, except that boolean is HIGHLY correlated with y in this model.

So, we to slightly re conceive the models.

# which tracts do you get any apps? (extensive margin)
# sample: all tracts 
'(app>0)~ minority_var * which_bank + census_tract_vars'

# how many apps are made? (extensive margin)
# sample: all tracts 
'logapp~ minority_var * which_bank + census_tract_vars'

# denial rates 
# sample: on tracts with both banks
'denial ~ minority_var * which_bank + census_tract_vars  + bank_specific_vars'

My suggestions on controls:

Then, I'm going to suggest changing this to 2 tables

We still need to show info on the other minorities and overall. So the below should result in 2 nice tables after you make it work.

table 3: denial rate is always the y variable, only use the "over median" version of minority districts

Note:

census_tract_vars = ['Ratio of Income to Poverty Level','Med.HousehIncome']
bank_tract_vars = ['mean_LTV', .... # finish]

results = []
x_vars = ['hisp_over_med','black_over_med','asian_over_med',majority_minority']

for v in x_vars :
    formula = f'denial_rate ~ {v} * which_bank + {'+'.join(census_tract_vars)} '
    print(formula)
    results.extend(
        smf.ols(formula,data=bank_tract_full.query("tract_denial_count == 2")).fit()
    )

# now I'll format an output table
# I'd like to include extra info in the table (not just coefficients)
info_dict={'R-squared' : lambda x: f"{x.rsquared:.2f}",
           'Adj R-squared' : lambda x: f"{x.rsquared_adj:.2f}",
           'No. observations' : lambda x: f"{int(x.nobs):d}"}

# instead of having a row for each minority vairable, combine into one row, column labels will distinguish
# create custom variable labels for the variables you want to combine
varlabels = {v:'High Minority' for v in x_vars}

# This summary col function combines a bunch of regressions into one nice table
print(summary_col(results=results, 
                  float_format='%0.2f',
                  stars = True, 
                  model_names=x_vars,
                  varlabels = varlabels,
                  info_dict=info_dict,
                  regressor_order=[ 'Intercept', "High Minority", "which_bank[T.BankofWest]", "High Minority:which_bank[T.BankofWest]",
                                 "median_all_income", "mean_LTV"]
                  )
     )

table 2: log(1+numapps) is always the y variable, only use the "over median" version of minority districts

jum223 commented 1 year ago

Comments

Ratio of Income to Poverty Level But we need to know what's the scaling on that variable. The median is 4500??

We already looked into this in class with the people doing the code and did some research on the variable and could not determine the reason for such large numbers, maybe we could exclude it as a control.

bank_tract_vars: mean_LTV applicant income mean dti

We don't have a mean debt-to-income ratio variable.

I will get working on the updated regression model above, I will let you know if I have any questions.

jum223 commented 1 year ago

On another note, should I do it for all minorities? This is the list of possible groups: Black, American Indian, Asian, Native Hawaiian Pacific, Hispanic.

donbowen commented 1 year ago

On another note, should I do it for all minorities? This is the list of possible groups: Black, American Indian, Asian, Native Hawaiian Pacific, Hispanic.

I'd do the largest groups: Black, Asian, Hispanic. Is a census tract an "over median" for each. I think we have those variables already, or can make easily. We also have that minority-majority variable. So 4 variables.

jum223 commented 1 year ago

@donbowen Just pushed the regression changed file with what I understood you asked for. It has 4 different regression tables and md text to explain what each table is doing.

donbowen commented 1 year ago

Ok, looks much nicer! closing in...

  1. i see one single census tract control? add more...
  2. There shouldn't be two majority-minority variables. Either the population of a tract is over or under 50% white.
  3. The actual tables should look like the below. The varlabels thing should accomplish this if done right. If you can't do it in python, you;ll do it manually...
==============================================================================================================
                                             y = denial_rate
==============================================================================================================
High Minority:                                      Hispanic    Black           Asian           Maj-Minority 
--------------------------------------------------------------------------------------------------------------
Intercept                                           0.13***       0.15***        0.18***        0.11***       
                                                    (0.01)        (0.01)         (0.01)         (0.01)        
which_bank[T.BankofWest]                            0.09***       0.07***        0.00           0.12***       
                                                    (0.00)        (0.01)         (0.01)         (0.01)        
MedHousehIncome                                     0.00          0.00           -0.00          0.00          
                                                    (0.00)        (0.00)         (0.00)         (0.00)        
hisp_over_med                                       0.05***      0.02***         -0.04***    0.06***                                       
                                                    (0.01)       (0.01)          (0.01)      (0.02)                                        
hisp_over_med:      which_bank[T.BankofWest]        -0.09***     -0.04***        0.08***     -0.12***                                      
                                                    (0.01)       (0.01)          (0.01)      (0.03)                                        
major_minority[T.Hispanic]                                                                      0.06***       
                                                                                                (0.01)        
major_minority[T.Hispanic]:which_bank[T.BankofWest]                                             -0.10***      
                                                                                                (0.01)        
R-squared                                           0.03          0.02           0.03           0.04          
R-squared Adj.                                      0.03          0.02           0.03           0.04          
R-squared                                           0.03          0.02           0.03           0.04          
Adj R-squared                                       0.03          0.02           0.03           0.04          
No. observations                                    8992          8992           8992           8992          
==============================================================================================================
jum223 commented 1 year ago

i see one single census tract control? add more...

Got it, I'll add a couple more

There shouldn't be two majority-minority variables. Either the population of a tract is over or under 50% white.

I only have one majority-minority variable it is being calculated based on the minority rates, eg if the hisp_rate variable is greater than the black_rate and asian_rate variables, it is set to Hispanic. I think this is how we did it in the past. From your comment I am understanding that instead it should be a boolean variable set to True (1) if the minority population of the tract is greater than the white population?

Also, I am struggling to conceptualize how to make the tables look like that, just from having the variables included in the regression models they will appear on the table.

jum223 commented 1 year ago

No worries, I think I got it. But what I'm unsure about is why we would include the major_minority variable inside of the x_vars, at the end of the day the regression table should only include three columns (hispanic, black, and Asian).

I've also manually changed the denial rates table that uses minority rates and put it in a txt file which can be found in the code folder, is this what you are looking for?

donbowen commented 1 year ago

Let me be direct to save time and I’ll explain tomorrow:

  1. it’s best that column 4 just uses a variable that equals 1 if the white fraction is under 50%.
  2. The text file looks great! That’s exactly what we want at the end, but you’re wasting time making it until we have complete agreement.
  3. MedHousehIncome And the other ones should be logged.
  4. check the correlation between the controls (after logging), if they are above say, 60% then just pick household income.

Looking forward to seeing them tomorrow!

jum223 commented 1 year ago

it’s best that column 4 just uses a variable that equals 1 if the white fraction is under 50%.

Got it done

MedHousehIncome And the other ones should be logged.

I logged the census_tract vars, should the bank_tract vars be logged too?

check the correlation between the controls (after logging), if they are above say, 60% then just pick household income

Just checked and they are indeed very highly correlated. On another note I also discovered what the RatioOfIncomeToPovertyLevel variable is representing, this variable is actually just the total population for which this ratio is computed. Nevertheless, we could still use this variable and some others, namely codes C17002_002E (ratio of income to poverty in the past 12 months, < 0.50) and C17002_003E (ratio of income to poverty in the past 12 months, 0.50 - 0.99), to calculate the poverty rate which we could then use as a control in the regression model. These two vars are not part of the .csv file we used to load census variables, so the codes along with the var name would have to be added.

donbowen commented 1 year ago

I logged the census_tract vars, should the bank_tract vars be logged too?

If it's a variable that is distributed like "income", yeah. Which are you talking about?

Just checked and they are indeed very highly correlated. On another note I also discovered what the RatioOfIncomeToPovertyLevel variable is representing, this variable is actually just the total population for which this ratio is computed.

Lol yeah delete that variable.

Nevertheless, we could still use this variable and some others, namely codes C17002_002E (ratio of income to poverty in the past 12 months, < 0.50) and C17002_003E (ratio of income to poverty in the past 12 months, 0.50 - 0.99)

Please copy the description of those variables here. What are they exactly?

jum223 commented 1 year ago

If it's a variable that is distributed like "income", yeah. Which are you talking about?

The mean_LTV and median_all_income variables

Lol yeah delete that variable.

Got it

Please copy the description of those variables here. What are they exactly?

Code C17002_002E is number of people who's ratio of income to poverty in the past 12 months is < 0.50, Code C17002_003E is number of people who's ratio of income to poverty in the past 12 months is between 0.50 - 0.99. If you add these two vars together and divide by the RatioOfIncomeToPovertyLevel variable (total population for which this ratio is calculated) we can get the poverty rate.

donbowen commented 1 year ago

The mean_LTV and median_all_income variables

Mean LVT shouldn’t be logged. Maybe winsodized! There is a function for that in the community codebook.

Median all income should be. But check its correlation with the income vars in the census. If it’s high enough we can’t use it BUT we could define a different var:

bank_gets_highinc_subpop = median all income > median income from census

Code C17002_002E is number of people who's ratio of income to poverty in the past 12 months is < 0.50, Code C17002_003E is number of people who's ratio of income to poverty in the past 12 months is between 0.50 - 0.99. If you add these two vars together and divide by the RatioOfIncomeToPovertyLevel variable (total population for which this ratio is calculated) we can get the poverty rate.

Awesome - great explanation. Implement!

donbowen commented 1 year ago

I do have a concerned about the majority minority regressions, because the coefficients are always zero. This suggests that there is an issue. You should investigate. Matt and Kathleen will ask about it.

Also, you should clarify to your classmates how they should interpret the columns - the white variable is the same but the column defines what the high minority variable is.

On Tue, Apr 11, 2023 at 6:13 PM Juan Mozos @.***> wrote:

@donbowen https://github.com/donbowen Just pushed the regression changed file with what I understood you asked for. It has 4 different regression tables and md text to explain what each table is doing.

— Reply to this email directly, view it on GitHub https://github.com/LeDataSciFi/FinTech-Capstone-2023/pull/19#issuecomment-1504195526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMEHJ2YD7JHSNWNIXD32PCDXAXJQXANCNFSM6AAAAAAWZQGEQI . You are receiving this because you were mentioned.Message ID: @.***>

-- Sent from my phone. Please excuse brevity and typos.

jum223 commented 1 year ago

I do have a concerned about the majority minority regressions, because the coefficients are always zero. This suggests that there is an issue. You should investigate. Matt and Kathleen will ask about it.

Will investigate it.

Also, you should clarify to your classmates how they should interpret the columns - the white variable is the same but the column defines what the high minority variable is.

Yeah, they know about that. Either way I will be meeting with Matt Romano, who will help me with the regression interpretations, and I will discuss it with him.

jum223 commented 1 year ago

@donbowen Hello professor, I was looking at the Regression file and noticed that for y = log_num_apps and y = Receives any apps in the tract the control vars being used are just the census_tract_vars (doesn't include the bank_tract_vars). I was wondeing if there is a reason for this when we updated the regression tables last Wednesday? See formula:

formula = f"denial_rate ~ Q('High Minority') * which_bank + {'+'.join(census_tract_vars)} + {'+'.join(bank_tract_vars)}"

VS

formula = f"log_num_apps ~ Q('High Minority') * which_bank + {'+'.join(census_tract_vars)}"

I also noticed that bank_tract_full["bank_gets_highinc_subpop"] = bank_tract_full["median_all_income"] > bank_tract_full["MedHousehIncome"] will be False for all of the observations, should we still use this variable as a control then?

I also windsorized the mean_LTV variable to use in the models.

I do have a concerned about the majority minority regressions, because the coefficients are always zero.

Regarding this point, I couldn't find an explanation of why the coefficients are always 0 in these columns. Nevertheless I was wondering what the nature of this column is and how we would use it for our analysis. Is it that the purpose of this column is to look at denial_rate in majority minority tracts?

Also, for the regression interpretations should we focus our analysis, on the interactions terms?

jum223 commented 1 year ago

Also, for the third table we created, wouldn't it be more appropriate to use a logistic regression given that some_apps is a binary variable?