AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
353 stars 31 forks source link

How to customize initial imputations #65

Closed MrWeijing closed 2 years ago

MrWeijing commented 2 years ago

Hello! Very useful project! I think proper initial imputation will improve the final imputation accuracy of the model, so I want to customize the initial imputation method. Now, I want to customize an initial imputation method, but I don't know how to modify the code. I hope to get some help and look forward to your reply.

AnotherSamWilson commented 2 years ago

Determining iteration 0 (initial imputations) occurs here: https://github.com/AnotherSamWilson/miceforest/blob/master/miceforest/ImputationKernel.py#L448-L495

The easiest thing for you to do would be to edit this function in your own install. The nice thing about this function is that it is used to impute new data too, so whatever rules you create will also be used in impute_new_data, if you plan on using it.

MrWeijing commented 2 years ago

Thank you for your answer! I also want to ask two questions.

  1. What is the data structure and meaning of the code "imputed_data[ds, var, 0]"? I think "ds" represents dataset index and "var" represents column index, but I don't know the meaning of "0"
  2. If I set two iterations to complete the imputation, how many times will the initial imputation method be called?
AnotherSamWilson commented 2 years ago
  1. The 0 corresponds to the 0th iteration. The imputations are stored as a 3-tuple dict, where the key is (dataset, variable, iteration).
  2. _initialize_dataset is only called once per ImputationKernel, in __init__(). However, that function will be called upon to initialize a new ImputedData if you call impute_new_data on a dataset. This should not affect the original kernel in any way, but if you edit the code, I can't make any guarantees.