Imputing wealth/parent income and Removing parent savings from parent net worth in college enrollment count

Addressing Issue #1 by imputing wealth/parent income in new script nlsy_impute.

Main question for team discussion in Issue #1: The current imputation uses a limited set of predictor variables, including student age, sex, and a combined race and ethnicity variable. I originally attempted including a larger set of predictor variables, but multiple imputation with the mice package is sensitive to predictor variables with a lot of missingness and to highly correlated predictors. At the same time, it would be good to review literature to see what types of predictors are traditionally included when imputing missing wealth and income, and to ensure we have sufficient predictor variables for imputation to be more accurate.

Addressing Issue #3 by removing parent savings from college enrollment count. Adapted the nlsy_lib script and the nlsy_get_col_stat_annual_df function. Utilized newly imputed net worth and subtracted net savings. If needed, this can be replicated in the function which only looks at fall semester enrollment.

Also, I created a new joint race and ethnicity variable in the nlsy_lib script.

Thank you, Sonia! I have a couple of thought regarding predictors used in imputation.

Predictors should be mostly parents' characteristics and not youth's, except maybe for race and ethnicity, because youth's race captures parents' race. Aside from race, parents' characteristics should include their age and education, and at the time of the interview (i.e., in 1997). NLSY has mother's age at youth's birth (although it's not in our current sample) but I couldn't find father's age. The survey also has parent's highest grade completed. Some other potential predictors are indicators for renting vs. owning a home, and having retirement savings.

We should also get a better understanding of what parent's income and wealth contain for youth's who don't live with both parents, and include variables that indicate such living arrangements.

To verify that predictors are significant and to check for collinearity, you could run a regression of the variable being predicted (e.g. wealth) on the set of predictors being considered, while restricting the sample to those with non-missing wealth observations.

We should also think of steps that would help us validate imputation results. For example, plotting the distributions of non-missing and imputed values would be helpful. Also, because we are imputing income and wealth jointly, a scatterplot of these two variables would be useful.

I hope this is helpful. Aside from these thoughts on imputation, I have some questions regarding changes in nlsy_lig.R. I will leave them as inline comments.

Thanks for the comments and support Damir! I have reverted all suggestions in the lib script, and addressed the comments on the imputation script. Updates on this script include:

Added new parent indicators (such as mom age, homeownership, retirement savings)
Created new dummy variables for relevant categorical variables (like parent education)
Checked for significance on subset of data with non-missing income and wealth. Excluded father education and dummy for deceased parents.
Added new multi-collinearity (correlation plots, stepwise regression, and VIF). Did not identify major concerns, so proceeded with selected parent indicators.
Created new data frames which show original wealth and income, values for original indicators, and the 10 iterations of newly imputed income and wealth
Added plots comparing imputed and non-imputed plots (for both income and wealth)
Added scatter plot comparing income and wealth for non-imputed and each iteration of imputation If this looks closer to what we would like, we can make a decision about how to call these imputed values in future modelling. Look forward to the team's thoughts!

UI-Research / baby-bonds

Imputing wealth/parent income and Removing parent savings from parent net worth in college enrollment count #11