Closed storresrod closed 11 months ago
Thank you, Sonia! I have a couple of thought regarding predictors used in imputation.
Predictors should be mostly parents' characteristics and not youth's, except maybe for race and ethnicity, because youth's race captures parents' race. Aside from race, parents' characteristics should include their age and education, and at the time of the interview (i.e., in 1997). NLSY has mother's age at youth's birth (although it's not in our current sample) but I couldn't find father's age. The survey also has parent's highest grade completed. Some other potential predictors are indicators for renting vs. owning a home, and having retirement savings.
We should also get a better understanding of what parent's income and wealth contain for youth's who don't live with both parents, and include variables that indicate such living arrangements.
To verify that predictors are significant and to check for collinearity, you could run a regression of the variable being predicted (e.g. wealth) on the set of predictors being considered, while restricting the sample to those with non-missing wealth observations.
We should also think of steps that would help us validate imputation results. For example, plotting the distributions of non-missing and imputed values would be helpful. Also, because we are imputing income and wealth jointly, a scatterplot of these two variables would be useful.
I hope this is helpful. Aside from these thoughts on imputation, I have some questions regarding changes in nlsy_lig.R. I will leave them as inline comments.
Thanks for the comments and support Damir! I have reverted all suggestions in the lib script, and addressed the comments on the imputation script. Updates on this script include:
Addressing Issue #1 by imputing wealth/parent income in new script
nlsy_impute
.Main question for team discussion in Issue #1: The current imputation uses a limited set of predictor variables, including student age, sex, and a combined race and ethnicity variable. I originally attempted including a larger set of predictor variables, but multiple imputation with the mice package is sensitive to predictor variables with a lot of missingness and to highly correlated predictors. At the same time, it would be good to review literature to see what types of predictors are traditionally included when imputing missing wealth and income, and to ensure we have sufficient predictor variables for imputation to be more accurate.
Addressing Issue #3 by removing parent savings from college enrollment count. Adapted the
nlsy_lib
script and thenlsy_get_col_stat_annual_df
function. Utilized newly imputed net worth and subtracted net savings. If needed, this can be replicated in the function which only looks at fall semester enrollment.Also, I created a new joint race and ethnicity variable in the
nlsy_lib
script.