AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
353 stars 31 forks source link

Differences in imputation results #81

Closed JossTG-UPM closed 1 year ago

JossTG-UPM commented 1 year ago

I am having problems reproducing the results of the imputation process. Right now I am creating a kernel and running the MICE algorithm as follows:

        kds = mf.ImputationKernel(
          data,
          variable_schema=var_sch,
          datasets=1,
          save_all_iterations=True,
          train_nonmissing=True,
          save_models=2,
          random_state=1991
        )

         kds.mice(1)

According to the documentation, once this process has been carried out, it is possible to get the complete (imputed) dataset using the _completedata() function, so I get it as follows:

         imputed_dataset = kds.complete_data(dataset=0, inplace=False)

So far so good. The problem comes when I tried to use this same kernel (kds) to impute the same dataset that was used to train the kernel (data) using the _impute_newdata() function, as follows:

       new_data = kds.impute_new_data(new_data=data, random_state=1991)
       new_data_imputed = new_data.complete_data(dataset=0, inplace=False)

The result with the _impute_newdata() function (contained in the variable _new_dataimputed) is different from that obtained directly from the kernel with the _completedata() function (contained in the variable _imputeddataset). I'm probably misunderstanding something, but shouldn't the results be the same in both cases? If not, what am I misunderstanding or doing wrong?

Thank you very much in advance for your help and time.

AnotherSamWilson commented 1 year ago

The two processes are not guaranteed to produce identical imputations, since the random state is used differently by each process. In the mice procedure, models are being generated, which uses the random state - impute_new_data does not generate new models, it just uses the old one. The processes are pretty similar though, for identical datasets it might be feasible to make these produce identical imputations if the model didn't depend on the random state.

JossTG-UPM commented 1 year ago

Thank you very much for your quick response @AnotherSamWilson. Precisely because of the use of the random state I thought that the results would have to be identical. In the mice procedure I use a random state (in my example 1991), which is the same random state that I then use in the _impute_newdata function. By using the same random state in both procedures and the same data set, I understood that both imputations should be identical.

AnotherSamWilson commented 1 year ago

This is not the case, if you used the same random state on two mice procedures, they would be identical, or the same random state on two impute_new_data procedures, they would also be the same. But it is not guaranteed to be the same between mice and impute_new_data.

Something to look at in the future, I am not sure if it is possible, since the model building depends on the random state right now.

I guess we could use the same seed for each model, or have a different random state that generates model seeds, but that is not implemented yet.