Clarification on Methodology & Datasets used in Paper?

aertslab / GENIE3

GENIE3 (GEne Network Inference with Ensemble of trees) R-package

26 stars 9 forks source link

Hi, I have some questions regarding the specific methodology of the paper.

Firstly, I'd like to apologise for not being an expert in stats, I watched some tutorials and read some chapters on relevant stats textbooks, but am having trouble translating their examples to this method.

What exactly does each decision tree look like? "For each gene j, a learning sample is generated with expression levels of j as output values and expression levels of all other genes as input values"

Is this what each tree would look like? (Gene "X" in my diagram would be equivalent of gene "j" in the paper) GENIE3 structure 1.1 And for each gene j, this is repeated many times, with all the inputs/intermediary genes randomized? 1.2 I assume for each tree, the whole counts matrix is used, i.e. all observations of gene expression profiles. How are conflicts handled? For example, in my above diagram, if Gene3>2 has a few counts of GeneX>10, but ALSO a few counts of GeneX<=10? Is it a winner-takes-all binary split? Do these conflicts directly affect the weight of each tree, i.e. "sums of total variance reduction"? If not the whole matrix used, what is the splitting/bagging heuristic? 1.3 How does the tree arrive at the binary split threshold at each node? And what about the final output Gene X/j (e.g. Gene X > or <= "10")? 1.4 Why was "sums of total variance reduction" used as opposed to a more traditional metric, i.e. Gini impurity? 1.5 Ultimately, decision trees should each "vote" on an outcome... But not in this application, right? Only the total variance reduction matters? Or does it work in another way entirely? 1.6 The vignette mentions 2 ways of threshoding: for example top 5 per gene, and weight >0.1. This was also left open in the paper. What value was used in the paper to win the DREAM4 challenge? How did the authors arrive at it?

About the E. coli microarray dataset used:

Escherichia coli Dataset ...It contains 907 E. coli microarray expression profiles of 4297 genes collected from different experiments at steady-state level. To validate the network predictions we used 3433 experimentally confirmed regulatory interactions among 1471 genes that have been curated in RegulonDB version 6.4 [40].

Am I right to say that this is a bulk sequencing of 907 E. coli cells in one go, on one plate, so the resulting expression file consists of only 1 column, i.e.

           Sample1
Gene1       9 
Gene2       4
Gene3       9
Gene4       7
Gene5       5
Gene6      10

If yes, how can a random forest get constructed? Does this use the same splitting/bagging heuristic as above?

I'd also appreciate feedback on whether I am asking the right questions, or if there are other related questions that I should have asked but didn't.

Thanks!

I've dug into the code a bit more, and have come up with some answers, some things are still no clear, though. would appreciate if the authors could confirm and give more clarity on the remaining questions (in bold):

What exactly does each decision tree look like?

I think my diagram was pretty close, with the difference that the output/leave nodes are split differently at different branches, as opposed to sharing a common split (10 in my above diagram).

1.1 And for each gene j, this is repeated many times, with all the inputs/intermediary genes randomized?

Yes, default 1000 trees with sqrt(number of total genes) per tree. The paper mentioned K=p-1 as outperforming K=sqrt(p), but sqrt(p) ended up default in the code. Is this for computational speed tradeoff?

1.2 I assume for each tree, the whole counts matrix is used, i.e. all observations of gene expression profiles. How are conflicts handled? For example, in my above diagram, if Gene3>2 has a few counts of GeneX>10, but ALSO a few counts of GeneX<=10? Is it a winner-takes-all binary split? Do these conflicts directly affect the weight of each tree, i.e. "sums of total variance reduction"? If not the whole matrix used, what is the splitting/bagging heuristic?

Yes, whole matrix is used, conflicts are simply minimized and put in "wrong" baskets and this end up affecting the final score (reduction of variance) at each node.

1.3 How does the tree arrive at the binary split threshold at each node? And what about the final output Gene X/j (e.g. Gene X > or <= "10")?

Minimized impurity split is calculated at each node.

1.4 Why was "sums of total variance reduction" used as opposed to a more traditional metric, i.e. Gini impurity?

Need information from the authors.

1.5 Ultimately, decision trees should each "vote" on an outcome... But not in this application, right? Only the total variance reduction matters? Or does it work in another way entirely?

Yes, the decision trees are not used in the traditional sense, what matters is the metadata (reduction in variance at each node).

1.6 The vignette mentions 2 ways of threshoding: for example top 5 per gene, and weight >0.1. This was also left open in the paper. What value was used in the paper to win the DREAM4 challenge? How did the authors arrive at it?

Need information from the authors.

About the E. coli microarray dataset used: Standard RF bagging procedure. Not sure if "replacements/duplicates" are allowed, though? The other resources I've read seem to set it to "on" as default.

aertslab / GENIE3

Clarification on Methodology & Datasets used in Paper? #15