AllenWLynch / lisa

MIT License
16 stars 9 forks source link

The number of num_background_genes #5

Closed shangguandong1996 closed 3 years ago

shangguandong1996 commented 3 years ago

Hi, Allen

I noticed the defaule set of num_background_genes is 3000. Why I should set num_background_genes other than selecting the whole gene set. After all, the whole gene set will not slow the speed or use too many memory. However, the random selecing background gene will make the final result un-consistent every time though the final result will not be much different.

Best wishes

Guandong Shang

Jingyu-Fan commented 3 years ago

Hi Guandong, I am one of Lisa's authors and I can help you with this question.

We use 3k background genes as default mainly because of the efficiency. Lisa would firstly select relevant DNase/H3K27ac samples representing the "chromatin landscape" of the input gene set. Then, for each TF ChIP-seq data(around 7k in total for humans), for every gene in the input and background gene sets, "in silico deletion" is performed to erase the "chromatin landscape" signal on the genes' surrounding peak regions. This is computationally intense. Please check the paper for details.

The default background genes are not randomly selected, actually. There is a list of genes that would be used as background. They are selected in some way so that 1) those genes are relatively consistently active across cell types and 2) those genes are not enriched in any gene ontology. Setting all the rest of the genes as the background is not necessary and would not make much sense to me... The dynamic of gene expression changes are more towards a matter of degree rather than binary. We cannot assume all other genes are not regulated at all.

Hope your question has been addressed well.

shangguandong1996 commented 3 years ago

Jingyu, thanks for your reply:). This is very helpful for me :) I think I may understand your meaning:

  1. "in silico deletion" for each gene will be computationally intense, so the background or input gene number should be control
  2. there should not be dominant TFs in background gene, so you use some screening condition to get un-related genes. But if use the total gene set, there may be some dominant TFs.

And I am confused about this sentence

The dynamic of gene expression changes are more towards a matter of degree rather than binary

Guandong Shang

Jingyu-Fan commented 3 years ago

For example, when cells are perturbed, differential genes are those genes that are statistically confident(FDR) and demonstrate high-level expression level change(fold change). But it does not mean that all the rest of the genes are constant during the perturbation. They might still be regulated but with less confidence to confirm.

shangguandong1996 commented 3 years ago

get it :).