AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
353 stars 31 forks source link

Add a save_kernel method #26

Closed samFarrellDay closed 2 years ago

samFarrellDay commented 3 years ago

Might be able to take advantage of parquet/feather or joblib to compress the working_data well beyond what normal byte compression is capable of.

SjoerdBraaksma commented 2 years ago

I agree! Would be usefull. How would you currently save/load kernels?

samFarrellDay commented 2 years ago

Kernels can be saved and loaded with the dill package. I have found that the pickle package doesn't work. Dill does a much better job of keeping track of object definitions and import requirements, even if they are nested and hidden away inside methods.

SjoerdBraaksma commented 2 years ago

Kernels can be saved and loaded with the dill package. I have found that the pickle package doesn't work. Dill does a much better job of keeping track of object definitions and import requirements, even if they are nested and hidden away inside methods.

Thanks for the quick response Sam! That really helps.

Might it be possible to add this to the introduction of the package description? I've gone through alot of documentation because of that sentence and the examples didn't show a save/load step. (e.g. "Kernels can be saved (recommended using the dill package) and impute new, unseen datasets. Imputing new data is often orders of magnitude faster than including the new data in a new mice procedure. Imputation models can be built off of a kernel dataset, even if there are no missing values. New data can also be imputed in place.

AnotherSamWilson commented 2 years ago

@SjoerdBraaksma You might be pleased to see a save_kernel() method has been added in 5.4.0. It uses parquet and byte compression to make the save file as small as possible.