ijyliu commented 6 months ago

Use all_data_fixed_quarters.parquet.

Predict credit rating. You should be able to use the financial features (can do several or just variable Altman_Z), as well as the Sector variable, as these are on the data. We might also be able to a few NLP features if these are done in time.

Evaluate accuracy of prediction. Create confusion matrix if time allows.

ijyliu commented 6 months ago

@current12 you can start writing code for a multinomial logit for this. the variables will be the same as the existing file even if we swap out the earnings calls

current12 commented 6 months ago

@current12 you can start writing code for a multinomial logit for this. the variables will be the same as the existing file even if we swap out the earnings calls

Sure!

ijyliu commented 6 months ago

Comments on https://github.com/current12/Stat-222-Project/blob/main/Code/simple_regression.ipynb

print out all variables in the dataset at the top of the code for reference

You can use Rating because the rating is the rating on the fixed quarter date and the earnings call and financial data is from before that. (you can do next rating also but that's more of an extra thing)

for the change prediction, i'd run with the upgrade v. downgrade v. constant variable rather than the number of notches of the change

for predictors, I'd run with just Altman_Z at first. Be careful with throwing too many variables in, a lot are collinear. if you do use a bunch of vars, also do a run setting penalty to 'l1' (LASSO penalty) and solver to 'liblinear'. print out the variables do you end up including.

for each prediction, show the share of the majority class as a baseline

on average, are our predictions too positive (predicted rating too high) or negative?

ijyliu commented 6 months ago

lr_bal=LogisticRegression(random_state=42,class_weight='balanced')

https://analyticsindiamag.com/handling-imbalanced-data-with-class-weights-in-logistic-regression/

ijyliu commented 6 months ago

you can add Sector to the regression too as a categorical

ijyliu commented 5 months ago

@current12 i'd suggest continuing to work on this as you have time. definitely add a one-hot encoding of Sector. also, it'd be nice if we had sector headings (created using ## in markdown) describing each model/section in the notebook so it's easy to scroll through and find stuff. finally, i'd keep adding combinations of the settings (with/without l1, weight balance, others) for all of the groupings of variables (difference X vars, different Y vars) as appropriate

ijyliu commented 5 months ago

consider writing a function to reduce code repetition and allow us to easily explore all relevant combinations of settings (l1 vs. l2, different X variable datasets, etc)

current12 commented 5 months ago

done

ijyliu commented 5 months ago

function looks good

you should add more to it, including confusion matrices. then move it to right above "2. Model". after this function, there will be a minimal amount of code in the rest of the notebook, just headings like this

and then minimal code for function settings, printing the variable names if needed, setting arguments, then a function call

and then another section for the next model, etc.

ijyliu commented 5 months ago

use variable 'train_test_80_20' as train-test split

54

ijyliu commented 5 months ago

I suggest adding a calculation of the share of cases where predicted rating is 1 or fewer ratings away from the actual one. And also, the share of cases that have a predicted rating in the same grade (A, B, C, D) as their actual one.

ijyliu commented 5 months ago

Reminder to update to using new dataset

# Limit to items in the finalized dataset
# list of files in '../../../Data/All_Data/All_Data_with_NLP_Features'
import os
file_list = [f for f in os.listdir(r'../../../Data/All_Data/All_Data_with_NLP_Features') if f.endswith('.parquet')]
# read in all parquet files
df = pd.concat([pd.read_parquet(r'../../../Data/All_Data/All_Data_with_NLP_Features/' + f) for f in file_list])

ijyliu commented 5 months ago

I actually suggest using grid search for a variety of parameter settings instead of doing the functions.

Example code attached. Logistic Regression Grid Search Example Code.zip

ijyliu commented 5 months ago

We also need insight into variable importance. So please add a permutation test (look at drop in accuracy when you randomly permute a feature), coefficient significance, or something else specific to logistic regression.

current12 commented 5 months ago

I just uploaded the latest version. For the grid searching part, as the current model is only a baseline and the performance is not bad, I think we can omit the grid search to find the best parameter. I'll work on coefficient significance tomorrow.

ijyliu commented 5 months ago

fixed file paths and moved to https://github.com/current12/Stat-222-Project/tree/main/Code/Modelling/Logistic%20Regression

I do think grid search is important and I'm pretty sure they're expecting us to explain hyperparameter choices and bias-variance tradeoff (they mentioned this in lecture several times) and for l1/l2/elasticnet logistic regression that's setting C (inverse of lambda). you can do it on SCF if it gets too slow, you don't even have to explicitly parallelize anything other than setting n_jobs=-1 (see below) and you can request as many CPUs as you want (and you could also only do solver of 'saga'). below is basically the only code you need

hyperparameter_settings = [
    # Non-penalized
    {'solver': ['newton-cg', 'lbfgs', 'sag', 'saga'], 
     'penalty': [None], 
     'C': [1],  # C is irrelevant here but required as a placeholder
     'class_weight': [None, 'balanced'], 
     'multi_class': ['ovr', 'multinomial']},
    # ElasticNet penalty
    {'solver': ['saga'], 
     'penalty': ['elasticnet'], 
     'C': [0.001, 0.01, 0.1, 1, 10, 100], 
     'l1_ratio': [0.0, 0.25, 0.5, 0.75, 1.0], 
     'class_weight': [None, 'balanced'], 
     'multi_class': ['ovr', 'multinomial']}
]

# Fit model
# Perform grid search with 5 fold cross validation
lr = LogisticRegression(max_iter=1000) # higher to encourage convergence
gs = GridSearchCV(lr, hyperparameter_settings, scoring='accuracy', cv=5, n_jobs=-1).fit(X, y)

print("tuned hyperparameters: ", gs.best_params_)
print("accuracy: ", gs.best_score_)
print("best model: ", gs.best_estimator_)

# Dump the best model to a file
joblib.dump(gs.best_estimator_, 'Best Logistic Regression Model.joblib')

let's not do any train-test splitting in this code and use the 'train_test_split_80_20' variable always. if we decide to fix the split to include every class (#54) we will do that upstream

for each model, I suggest saving an Excel file with accuracy, precision, F1, etc. and also the plot of the confusion matrix for us to use in the writeups. for the table you can just use the classification_report built-in to sklearn

from sklearn.metrics import classification_report

you can output a table of the coefficient significance to Excel also

current12 commented 5 months ago

I've uploaded the latest one with grid search and results in the notebook

ijyliu commented 5 months ago

it looks like your code doesn't have class D. did you git pull before running? be sure to do that so you have the latest version of the data.

it also looks like some of the model runs still use test_size. just delete the code that has the setting from the functions and fix any errors that arise

please do try to use the '../..' relative paths so that code is runnable on other people's machines (this might mean you have to run the notebook from the folder it is located in).

other than that i think we can go ahead and get setup to produce all the outputs: excel file of the classification report for each run, excel file of coefficient significance, png of confusion matrix. you can put these in Output/Modelling and create a Logistic Regression folder there.

ijyliu commented 5 months ago

i produced this, which should be helpful to you in several ways.

https://github.com/current12/Stat-222-Project/blob/main/Code/Data%20Loading%20and%20Cleaning/All%20Data/Variable%20Index.xlsx

first, it has a match from the raw variable names to a nicely formatted version suitable for tables and figures.

second, it has information on how we should use each variable. you can use this to refine your variable groupings and decide what to include/exclude. don't include things that are disallowed and don't include Predicted - Change when you are predicting ratings and vice versa.

you can load it in as dataframe and/or create a dictionary mapping or whatever to make it easy to rename variables and pick variables for models.

# Load '../../Data Loading and Cleaning/All Data/Variable Index.xlsx'
var_index = pd.read_excel('../../Data Loading and Cleaning/All Data/Variable Index.xlsx')
# Keep column_name and 'Clean Column Name'
var_index = var_index[['column_name', 'Clean Column Name']]
# Create dictionary mapping column_name to Clean Column Name
var_index_dict = dict(zip(var_index['column_name'], var_index['Clean Column Name']))
var_index_dict

current12 commented 5 months ago

it looks like your code doesn't have class D. did you git pull before running? be sure to do that so you have the latest version of the data.

it also looks like some of the model runs still use test_size. just delete the code that has the setting from the functions and fix any errors that arise

please do try to use the '../..' relative paths so that code is runnable on other people's machines (this might mean you have to run the notebook from the folder it is located in).

other than that i think we can go ahead and get setup to produce all the outputs: excel file of the classification report for each run, excel file of coefficient significance, png of confusion matrix. you can put these in Output/Modelling and create a Logistic Regression folder there.

what is class D?

ijyliu commented 5 months ago

rating D I meant

ijyliu commented 5 months ago

@current12

[ ] create separate code file (.py or .ipynb to run in parallel on SCF). the code file should output everything possible and save to the folder - hyperparameters, classification report, all metrics, confusion matrices
[ ] add to this file something for feature importance on the most complicated model - coefficients and significance version, permutation and accuracy drop version
[ ] confusion matrix for the most complex model
[ ] changes model - confusion matrix,

table output for reports (Excel + LaTeX) constructed in separate individual files: @OwenLin2001 and @ijyliu can assist after they do readme/cleaning step stuff + share other helpful report things

[ ] table with row for each model (including the change in rating model) and columns for accuracy, everything else in the last row (weighted avg) of classification report, and share 1 rating or less away from actual. plus one row with majority baseline and the number filled in for the accuracy column. (Isaac to mockup)
[ ] for the most complex model, the entire classification report cleaned up and in Excel, + row for majority baseline + row for share 1 or fewer away from correct
[ ] table of hyperparameters selected for most complex model (with one row)
[ ] changes model - table with one row with accuracy, f1, majority class baseline. then in your reporting, mention we could change objective to f1 or force rebalancing to stop predictions from focusing on majority class

ijyliu commented 5 months ago

mockups attached and also in the folder

Table Mockups.xlsx

ijyliu commented 5 months ago

@current12 let us know once you have run all the models on the most recent data and saved everything and we can start working on output

ijyliu commented 5 months ago

see https://github.com/current12/Stat-222-Project/issues/31#issuecomment-2033191620

current12 commented 5 months ago

@current12 let us know once you have run all the models on the most recent data and saved everything and we can start working on output

owen help me run the output, he will upload the result

ijyliu commented 5 months ago

just pushed a fix to the feature columns. (using rating_on_previous_fixed_quarter_date, not previous rating, which is the last rating in the raw credit data). this will probably improve accuracy.

please see the attached Variable Index.xlsx

we may also want to exclude the items that are Constructed for Tone because those are PCA'd into TONE1 through a linear combination (PN, SW, AP, OU are at least, the other ones enter into these as ratios so maybe there's value in having them)

the items marked as metadata are allowed but not super-well explored yet, probably don't want to add those right now.

ijyliu commented 4 months ago

code completed, pending edits to underlying variables

current12 / Stat-222-Project

Logistic Regression #31

54