dhimmel / learn

Machine learning and feature extraction for the Rephetio project
https://doi.org/10.15363/thinklab.d210
4 stars 5 forks source link

How do you generate blacklist.tsv in 4-predictr.ipynb? #10

Closed ltrainstg closed 3 years ago

ltrainstg commented 3 years ago

I was trying to reproduce some of the predictions from this project and noticed that blacklist.tsv seems to come from nowhere. How was this file generated? Is it just eliminating all the CtD and DtC features?

I think this was done with a grepl in the all features file 5.6-model.ipynb?

dhimmel commented 3 years ago

Quoting from https://think-lab.github.io/d/210/#4:

Despite our efforts to remove features susceptible to edge dropout contamination during Stage 1 feature selection, several features were getting incorporated into the Stage 2 model that showed evidence of contamination: features with treats relationships receiving negative coefficients when their marginal association with treatment is positive. Hence, I manually assembled a feature blacklist, which I iteratively added features to that were showing evidence of contamination. In total, 22 features were blacklisted.

If you have flexibility with your experimental design (as opposed to a pure replication), I would consider an approach that won't be affected by prediction-edge dropout contamination. The way to do this would be to train on a network that doesn't contain any edges that are used to assign sample status. For example, if you use an edge to assign a positive status to a Source-Target pair (i.e. sample), then remove that edge from the network prior to computing any network-based features for any sample.

ltrainstg commented 3 years ago

got it thanks.