Closed kopant closed 1 month ago
That's a good idea, thanks! We'll incorporate that when doing the enhancement.
Hi @kopant, happy to note that you can now do this by setting numerical_handle_na
to True and modifying numerical_how_handle_na
to either "mean", "median" or "value". If you want to use a specific value, you can set numerical_na_value
.
Thanks for making the change, @akashsaravanan-georgian!
Since you mentioned you're considering enhancing load_data(), I might also try to expose to the user different methods for imputation of missing numeric data. Currently in data_utils.load_num_feats() this defaults to median imputation, but this can be a poor choice if the reason the data is missing is due to real differences in the data generating process (ie, NULL data actually followed a different process than non-NULL data, and is meaningfully distinct from non-NULL data). In that case, one might instead want to encode the missing data with a distinct value from the non-NULL distribution prior to modeling.