Closed ijyliu closed 3 months ago
Potentially ask Libor
we are not going to be missing too many items, so I suggest just dropping all observations missing any of the covariates
https://github.com/current12/Stat-222-Project/issues/20#issuecomment-2007769087
Proposed solution: drop items missing any variable in test set (so accuracy/performance is comparable), but still allow items missing variables in training set (make full use of the data we have).
Going to eventually go ahead and drop items missing any covariate of interest
Going to make a new version of all data that contains everything in the original + NLP features but drops things missing anything
Created https://github.com/current12/Stat-222-Project/tree/main/Data/All_Data/All_Data_with_NLP_Features
Now need to adjust all code to use it.
completed adjustment on my files and left reminders on other issues
Create drop in the code to create all data
Allows for comparability across all models - same train-test split, same number of observations
Output dataset with the observation and why it was dropped