eddatasci / unrollment_proj

The Unrollment Project: Exploring algorithmic bias in predicting bachelor's degree completion.
5 stars 0 forks source link

Create R function to read in and do basic data wrangling #15

Open wdoyle42 opened 4 years ago

wdoyle42 commented 4 years ago

R function that will read in the data and do basic wrangling.

Inputs: list of names of dependent variable and independent variables from ELS, name of local data file Outputs: .rds data file, with all lower case names, missing data handled, filtered for four year enrollees only.

Much of this code is now in predict_grad.Rmd, lines 83-130

wdoyle42 commented 3 years ago

Ok, did some work on this one, it's in branch issue_15, dataset is all base year, f1 and f2 information, dropping weights and flags, adding in outcome (ba completion by 3rd followup) with factors and age formatted appropriately. I think this could be ready for a model that uses regularization.

btskinner commented 3 years ago

Thanks, @wdoyle42. Are you ready for someone else to take a look or are you still working on it?

wdoyle42 commented 3 years ago

@btskinner still working. I'm looking for a way to programmatically drop both non-informative and perfectly collinear variables before handing it off. I'll take one more pass and submit for review.

wdoyle42 commented 3 years ago

@btskinner can you take a look at latest commit? I want to drop highly correlated variables, but want to keep frequently used composite variables as opposed to others-- e.g. bypared over its source variables. I'm going on circles on this. Any ideas?

btskinner commented 3 years ago

@wdoyle42, I'll take a look and get back.