DavisLaboratory / singscore

An R/Bioconductor package that implements a single-sample molecular phenotyping approach
https://davislaboratory.github.io/singscore/
40 stars 5 forks source link

Abundance Matrix Preprocessing #24

Closed DarioS closed 3 years ago

DarioS commented 3 years ago

Could the vignette have some instructions added about how to preprocess the gene abundance matrix before calculating the ranks? For example, I have total RNA-seq data which has had RiboZero treatment to get rid of the ribosomes. So, I have about 20 thousand rows for protein-coding genes and about another 20 thousand rows for long noncoding RNA (lncRNA). However, most gene set and pathway databases do not have any lncRNA in their pathways and networks. However, such measurements would affect the rankings of protein coding genes. The vignette should stress the importance of this step.

bhuvad commented 3 years ago

Hi @DarioS,

In such a scenario, you would generally use the protein coding genes only as those are the only ones represented by the gene-sets you are testing against. Another approach we tend to use is to filter out genes with low expression (rather than those that code for proteins) as this gives a better idea of relative expression (against the entire transcriptome). You are right in that we do not include this in the vignette and this is mainly because we have this information elsewhere (we did not want to duplicate information). The detailed discussion on this matter, along with many others that you would face while using singscore are in the workflow paper we published in F1000Research (https://f1000research.com/articles/8-776). Since this is workflow covers all processing steps, it is much more detailed and allowed us to discuss the implications of each decision point in the analysis. Feel free to ask us for further help on matters not discussed in the workflow.

Cheers, Dharmesh