georgian-io / Multimodal-Toolkit

Multimodal model for text and tabular data with HuggingFace transformers as building block for text data
https://multimodal-toolkit.readthedocs.io
Apache License 2.0
587 stars 84 forks source link

Imputation of numerical data #69

Closed kopant closed 1 month ago

kopant commented 8 months ago

Since you mentioned you're considering enhancing load_data(), I might also try to expose to the user different methods for imputation of missing numeric data. Currently in data_utils.load_num_feats() this defaults to median imputation, but this can be a poor choice if the reason the data is missing is due to real differences in the data generating process (ie, NULL data actually followed a different process than non-NULL data, and is meaningfully distinct from non-NULL data). In that case, one might instead want to encode the missing data with a distinct value from the non-NULL distribution prior to modeling.

akashsaravanan-georgian commented 8 months ago

That's a good idea, thanks! We'll incorporate that when doing the enhancement.

akashsaravanan-georgian commented 1 month ago

Hi @kopant, happy to note that you can now do this by setting numerical_handle_na to True and modifying numerical_how_handle_na to either "mean", "median" or "value". If you want to use a specific value, you can set numerical_na_value.

kopant commented 1 month ago

Thanks for making the change, @akashsaravanan-georgian!