georgian-io / Multimodal-Toolkit

Multimodal model for text and tabular data with HuggingFace transformers as building block for text data
https://multimodal-toolkit.readthedocs.io
Apache License 2.0
587 stars 84 forks source link

Is there a way to save the preprocessing objects for inference? (OneHotEncoder, Scaler) #76

Closed kkristacia closed 1 month ago

kkristacia commented 5 months ago

Hi thank you for developing this package! I want to be able to load the already saved model, then use it for inference like in production. How can I let the inference dataset to go through the same preprocessing steps eg. OneHotEncoding of categorical variables, scaling?

akashsaravanan-georgian commented 5 months ago

Hi @kkristacia, To load the model, you just need to run the same steps as creating the model. The only difference is that while calling model = AutoModelWithTabular.from_pretrained(...) make sure you set the first argument pretrained_model_name_or_path to the path that you saved your model in.

Similarly, to preprocess the inference dataset, I would recommend running load_data_from_folder function with the same parameters used in the load_data_from_folder while training. Use the same training data to reconstruct the encoders and replace the test data with your inference data. I know this isn't optimal so we'll definitely change this in a future version.

Please let me know if you run into any other issues and I can help you solve it! :)

kkristacia commented 5 months ago

Hi Akash, thanks for the clarification. Yea I was hoping for some way to not use the training data during inference. Definitely will be great if future versions can have the functionality!

dsunart commented 5 months ago

Hi Akash. Just to second this - it would be great if the preprocessing objects were saved for making inferences in production. Loading my whole dataset into my production environment would take up space unnecessarily. Love the toolkit, and looking forward to seeing an update in the future!

akashsaravanan-georgian commented 5 months ago

Thanks @dsunart! I'm reopening this issue as a feature request. It should be added in as part of our next release!

akashsaravanan-georgian commented 1 month ago

Hey @kkristacia and @dsunart, happy to note that this is now part of the toolkit. You can see this in action in this example.