DavisLaboratory / singscore

An R/Bioconductor package that implements a single-sample molecular phenotyping approach
https://davislaboratory.github.io/singscore/
40 stars 5 forks source link

Using singscore for comparison between different datasets #40

Closed ICA-D closed 1 year ago

ICA-D commented 1 year ago

Hi, Thank you for creating such a useful and intuitive package, it has been very useful in my work so far. I have been using singscore on a patient cohort to look at association between scores for a geneset and clinical features. Gene expression in this initial dataset is from RNAseq data, and I have therefore normalised appropriately before using singscore. However, to validate any findings, I am trying to use a secondary dataset that is publicly available, whose gene expression data is microarray-based. Can I make any comparisons between the two datasets, in terms of absolute values of scores (e.g. scores are generally higher in one dataset than the other, suggesting higher expression of the genes in my geneset)? Given the fact that scores are calculated within each sample, and that the dynamic range of scores should be the same, I think this would be a valid approach? Many thanks again.

Malvikakh commented 1 year ago

Hi @ICA-D,

Thank you for using singscore! We are glad to hear that you found our package helpful to use for you research.

Scores from singscore cannot be directly compared between two different datasets. The reason for this is that the absolute values of scores in gene expression datasets do not hold inherent meaning. Rather, these scores serve as an indication of how the gene expression is regulated within a particular dataset relative to other genes within the same dataset. Comparing scores between datasets would overlook the context and specific regulatory patterns within each dataset, potentially leading to incorrect or misleading conclusions.

You are in the right direction by normalising the data before using singscore. Similarly, you should account for batch effects prior to using singscore. Though singscore can handle sample specific effects, if there are batch effects that alter the relative ranking of genes (which would be the case with microarrays as they profiled specific genes thereby introducing a bias) or other more complex kinds of batch effects that could affect datasets generated across different studies, you would need more powerful methods. You could use a method like RUV to remove these effects. There are a few variants of RUV that work with negative control genes (genes you expect not to vary across the datasets) and/or replicate/pseudo-replicates (RUV4) across datasets. If you wanted to use this, you could use negative control genes we have identified in the follow-up singscore paper (DOI: 10.1093/nar/gkaa802). These can be acquired from within the singscore package using the getStableGenes(300) function which would give you 300 negative control genes that are known to be stable across various cancers and normal tissues (solid tissues).

I hope this information helps answer your question. If you have any further questions, suggestions, or if there's anything else we can assist you with, please feel free to reach out to us.

Thanks, Malvika