cnio-bu / cluster_rnaseq

3 stars 9 forks source link

Filtering out low count genes #12

Open mj-jimenez opened 1 year ago

mj-jimenez commented 1 year ago

In deseq_init.R, this line removes genes with less than 10 counts in all samples. This step has been copied from the deseq2 vignette and aims to reduce memory usage and speed up the computation.

However, I was told that some users create different deseq2 objects from the same count matrix in order to compare two groups of samples. For example, given this design matrix:

sample condition genotype
WT_treat1 treated WT
WT_treat2 treated WT
WT_treat3 treated WT
WT_control1 control WT
WT_control2 control WT
WT_control3 control WT
KO_treat1 treated KO
KO_treat2 treated KO
KO_treat3 treated KO
KO_control1 control KO
KO_control2 control KO
KO_control3 control KO

Some users run two analyses to compare treatment vs control within each genotype (instead of modelling a complex design such as ~genotype + condition). As some genes may have low count number within one of the genotypes, the normalized counts will not contain exactly the same genes.

First, I would like to know how common this procedure is. Maybe some regular users can give feedback @ELENAPINEIRO @ralvarez-hub @jlanillos @lserranor @Maria-rfranklin.

As the authors state in the vignette, this is not an essential step. If the above procedure is standard, maybe we can just remove these lines. @SGMartin what do you think?

SGMartin commented 1 year ago

I honestly think this is not an issue but removing the prefiltering step is also unlikely to cause any harm. On the other hand, generating one dds object for each comparison should not be the default approach for any RNA-Seq analysis.

PD: The aforementioned design can be simplified to a single compound condition, unless users are interested in specific interaction terms.