darya-chyzhyk / confound_prediction

Confound-isolating cross-validation approach to control for a confounding effect in a predictive model.
BSD 3-Clause "New" or "Revised" License
39 stars 10 forks source link

Generalization to more than 1 confounding factor #12

Open Rachine opened 4 years ago

Rachine commented 4 years ago

Hello, Thank you very much for tackling this issue of confounders, which seems very recurrent in clinical ML problems.

I have some questions about the project/paper:

  1. I am wondering why only the test set needs to be Deconfounded? Why not build also a train set which is Deconfounded and a Deconfounded test set (with no data leakage of course)?
  2. I tried to make a generalization of your methodology with k multiple confounders image I still used most of your codebase and I used a pseudo generalization of the mutual information of multiple variables. The probability to be sampled m_i which was image

is now:

image

The quantity image can still be estimated with kernel density estimation.

I made some quick toy examples, it seems to approximately work on simple additive toy examples and when the number of example is sufficient: For instance with 1000 sample and 10 confounding factors i got: image For instance with 100 sample and 3 confounding factors i got:

image

It would be also interesting to study the required N to be sure at a certain level the deconfounding capability for k factors considering the type of link.

Do you think this is a correct approach and generalization?

Thank you

Best regards

Rachine commented 4 years ago

Oops, after some thinking maybe I should look at the goodness of fit with the multiple variable and not only individual correlations, to test image

image

image I added the R^2 when I do a Ordinary Least Squares with stats model 'y ~ z0 + z1 + z2'