AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
353 stars 31 forks source link

Saving a Kernel leads to an error #85

Closed benedictjones closed 9 months ago

benedictjones commented 11 months ago

Having followed the readme guide, I have something like this:

import miceforest as mf
import pandas as pd
import numpy as np

# Create kernel
kds = mf.ImputationKernel(
    data=df,
    datasets=2,
    save_all_iterations=True,
    random_state=42,
)

kds.mice(4)

If I then try to save the kernal as follows:

kds.save_kernel("models/MF")

I get an error: ValueError: I/O operation on closed file.

I can't find any example of anyone actually saving and loading a kernel, so it might be nice to add this to the docs?

I am assuming this isn't difficult, but could someone please assist? Is there supposed to be a specific file type that should be used?

Thanks!

AnotherSamWilson commented 11 months ago

Hmmm that's interesting. I do save kernels all the time that way, never seen that error. I don't see any problem with the way you are calling it. What OS are you using?

benedictjones commented 11 months ago

I am on MacOS (Sonoma 14.1), seems to be a pickle issue caused when attempting to save?

The problem originates from line 1863 in ImputationKernel.py:

with open(filepath, "wb") as f:
   dill.dump(
      blosc.compress(
        dill.dumps(kernel),  # <-- error comes from here
        clevel=clevel,
        typesize=8,
        shuffle=blosc.NOSHUFFLE,
        cname=cname,
       ),
       f,
  )
benedictjones commented 10 months ago

Ok, I worked though the basic example with a fresh conda environment. If you try to save without fastparquet or pyarrow installed, you get the error:

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

If you try to save without pyarrow installed, but with fastparquet installed, you get the error I found:

ValueError: I/O operation on closed file.

But, if you have fastparquet AND pyarrow installed, it runs, saves and loads correctly.

TLDR: I would recommend people ensure they have pyarrow and fastparquet installed.