georgian-io / Multimodal-Toolkit

Multimodal model for text and tabular data with HuggingFace transformers as building block for text data
https://multimodal-toolkit.readthedocs.io
Apache License 2.0
587 stars 84 forks source link

Pass through argument 'handle_unknown' of sklearn OneHotEncoder #66

Closed kopant closed 1 month ago

kopant commented 8 months ago

In data_utils.CategoricalFeatures._one_hot(), could you let the handle_unknown argument to sklearn's OneHotEncoder be passed through to the user, so that they have the option to specify handle_unknown='ignore'? As-is, the code and example notebook becomes problematic in the common case where train versus validation or test sets experience different distinct levels for a categorical variable. In this case, we cannot score the model trained on the train set on the test or validation set because the different number of levels in the categorical variable will cause an error. This happens when you have rare categorical levels.

Another option is to use load_data() on the entire modeling dataset and only then split it into train-test-val later, but this seems not straightforward, at least if you have pre-existing indices for the train-test-val sets (given you are trying to split a pytorch Dataset).

akashsaravanan-georgian commented 8 months ago

Hi @kopant, thanks for raising this issue! We're looking into modifying how our load_data() function works to make it easier to work with so you can look forward to that in the future.

I'd like to understand the error you're facing now. Currently the code uses the entire dataset (train+val+test) to build the categorical encoders. So you shouldn't be running into any errors in this scenario, in theory at least. Could you go into a bit more detail or alternatively share the dataset you're trying this with?

kopant commented 8 months ago

Hi @akashsaravanan-georgian, I believe in the example jupyter notebook, the data is first split into test-train-val before the model is then trained on the train set. And, as is usually the case, the model is then evaluated on the test set. It's true that if we ran the categorical encoders on the entire (train+val+test) dataset, there wouldn't be an issue with the encoders. However, in this case how would we split the dataset after encoding into train/test/val? This may be something I'm just not aware of, but it seems difficult to split torch Datasets such that they align to pre-defined indices for test/train/val. One can fairly easily create new test/train/val splits using built-in torch functions, but I often find myself wanting to use a predefined set of splits (say, indexed by row index or the like). So, because the output of the categorical encoding is a Torch dataset, this then becomes difficult.

akashsaravanan-georgian commented 8 months ago

Hi @kopant, Thanks for taking the time to explain! The library itself does not do any data splitting - it just expects data that has already been split. So the solution in this case is to split the data before doing the encoding. If you use load_data_from_folder, it will load your pre-split data, process them, and return them to you. Internally, we are combining the datasets, processing them as a whole, and splitting them back into their original segments before converting them into torch datasets.

I hope that helps! Happy to answer any other questions/clarifications you may have.

akashsaravanan-georgian commented 1 month ago

Hi @kopant, happy to note that you can now do this by passing in "ohe_handle_unknown" as part of your training arguments. The supported values are "error" (default), "ignore" and "infrequent_if_exist".