Generalization to more than 1 confounding factor

Hello, Thank you very much for tackling this issue of confounders, which seems very recurrent in clinical ML problems.

I have some questions about the project/paper:

I am wondering why only the test set needs to be Deconfounded? Why not build also a train set which is Deconfounded and a Deconfounded test set (with no data leakage of course)?
I tried to make a generalization of your methodology with k multiple confounders I still used most of your codebase and I used a pseudo generalization of the mutual information of multiple variables. The probability to be sampled m_i which was

is now:

The quantity can still be estimated with kernel density estimation.

I made some quick toy examples, it seems to approximately work on simple additive toy examples and when the number of example is sufficient: For instance with 1000 sample and 10 confounding factors i got: For instance with 100 sample and 3 confounding factors i got:

It would be also interesting to study the required N to be sure at a certain level the deconfounding capability for k factors considering the type of link.

Do you think this is a correct approach and generalization?

Thank you

Best regards

darya-chyzhyk / confound_prediction

Generalization to more than 1 confounding factor #12