SydneyBioX / scMerge

Statistical approach for removing unwanted variation from multiple single-cell datasets
https://sydneybiox.github.io/scMerge/
66 stars 13 forks source link

any suggested cutoff for stably expressed genes, segIdx #28

Closed wangmhan closed 1 year ago

wangmhan commented 4 years ago

I want to ask if there is any suggested cutoff for stably expressed genes. I want to get the stably expressed genes (SEGs) from my dataset, while the function scSEGIndex return matrix with several statistics. Just wondering if there is cutoff for the statistics to get SEGs.

Even in the paper, this part is not very clear, I only found one description: Genes with a stability index rank percentile >80 as well as a reversed rank percentile >60 for each of the 4 stability features were included in the SEG list.

Thanks for helping.

kevinwang09 commented 4 years ago

Hi, the SEG index is a statistics that summarises the stability of a gene as defined in a separate paper: https://academic.oup.com/gigascience/article/8/9/giz106/5570567. Higher SEG value means the mean is more stable.

Thus, to get the genes in the top 20% (80% percentile) of the dataset, you may use the code below and extract the rownames.

Hope that helps.

library(scMerge)
## Loading example data
data('example_sce', package = 'scMerge')
## subsetting genes to illustrate usage.
exprs_mat = SummarizedExperiment::assay(example_sce, 'logcounts')[1:110, 1:20]
set.seed(1)
result1 = scSEGIndex(exprs_mat = exprs_mat)
#> Calculating scSEG index without F-statistics
#> Fitting the mixture model...
seg = result1$segIdx

result1[seg >= quantile(seg, 0.8),]
#>                       segIdx rho     sigma        mu mu.scaled zero
#> ENSMUSG00000020152 0.7217125   1 0.5455141 12.655163 0.9131146    0
#> ENSMUSG00000022295 0.6513761   1 0.7774563  9.353649 0.5835237    0
#> ENSMUSG00000002015 0.6391437   1 0.8483420 10.314549 0.6794506    0
#> ENSMUSG00000024073 0.6788991   1 0.6954993 10.426437 0.6906204    0
#> ENSMUSG00000022247 0.7370031   1 0.4592164 11.337434 0.7815654    0
#> ENSMUSG00000051223 0.7308869   1 0.5006202 11.980963 0.8458090    0
#> ENSMUSG00000020368 0.7278287   1 0.5199094 13.437971 0.9912625    0
#> ENSMUSG00000027184 0.6422018   1 0.8128390 10.853291 0.7332333    0
#> ENSMUSG00000075266 0.6911315   1 0.6429432  8.308871 0.4792233    0
#> ENSMUSG00000028044 0.7125382   1 0.5788207 10.252988 0.6733049    0
#> ENSMUSG00000035885 0.7461774   1 0.4005495 10.042249 0.6522668    0
#> ENSMUSG00000074698 0.6850153   1 0.6840875  8.230288 0.4713784    0
#> ENSMUSG00000020719 0.6605505   1 0.7289491 11.337287 0.7815507    0
#> ENSMUSG00000029169 0.7155963   1 0.5757002  9.919718 0.6400346    0
#> ENSMUSG00000035530 0.7064220   1 0.5973281  9.911616 0.6392257    0
#> ENSMUSG00000067194 0.6819572   1 0.6865677  9.847681 0.6328430    0
#> ENSMUSG00000045983 0.6636086   1 0.7275590  9.492561 0.5973914    0
#> ENSMUSG00000026083 0.6574924   1 0.7455318  9.907629 0.6388276    0
#> ENSMUSG00000024360 0.6880734   1 0.6783188 10.324937 0.6804876    0
#> ENSMUSG00000028034 0.6452599   1 0.7883429 10.848090 0.7327141    0
#> ENSMUSG00000027680 0.6972477   1 0.6197897  9.940147 0.6420739    0
#> ENSMUSG00000018583 0.7186544   1 0.5649137 11.301127 0.7779409    0

Created on 2020-05-27 by the reprex package (v0.3.0)

wangmhan commented 4 years ago

Hi, Thanks for the reply. Besides segIdx, there are also other statistics such as rho, sigma, mu, etc. Should I also set threshold for them, or segIdx is enough?

On Wed, 27 May 2020 at 02:53, kevinwang notifications@github.com wrote:

Hi, the SEG index is a statistics that summarises the stability of a gene as defined in a separate paper: https://academic.oup.com/gigascience/article/8/9/giz106/5570567. Higher SEG value means the mean is more stable.

Thus, to get the genes in the top 20% (80% percentile) of the dataset, you may use the code below and extract the rownames.

Hope that helps.

library(scMerge)## Loading example data data('example_sce', package = 'scMerge')## subsetting genes to illustrate usage.exprs_mat = SummarizedExperiment::assay(example_sce, 'logcounts')[1:110, 1:20] set.seed(1)result1 = scSEGIndex(exprs_mat = exprs_mat)#> Calculating scSEG index without F-statistics#> Fitting the mixture model...seg = result1$segIdx result1[seg >= quantile(seg, 0.8),]#> segIdx rho sigma mu mu.scaled zero#> ENSMUSG00000020152 0.7217125 1 0.5455141 12.655163 0.9131146 0#> ENSMUSG00000022295 0.6513761 1 0.7774563 9.353649 0.5835237 0#> ENSMUSG00000002015 0.6391437 1 0.8483420 10.314549 0.6794506 0#> ENSMUSG00000024073 0.6788991 1 0.6954993 10.426437 0.6906204 0#> ENSMUSG00000022247 0.7370031 1 0.4592164 11.337434 0.7815654 0#> ENSMUSG00000051223 0.7308869 1 0.5006202 11.980963 0.8458090 0#> ENSMUSG00000020368 0.7278287 1 0.5199094 13.437971 0.9912625 0#> ENSMUSG00000027184 0.6422018 1 0.8128390 10.853291 0.7332333 0#> ENSMUSG00000075266 0.6911315 1 0.6429432 8.308871 0.4792233 0#> ENSMUSG00000028044 0.7125382 1 0.5788207 10.252988 0.6733049 0#> ENSMUSG00000035885 0.7461774 1 0.4005495 10.042249 0.6522668 0#> ENSMUSG00000074698 0.6850153 1 0.6840875 8.230288 0.4713784 0#> ENSMUSG00000020719 0.6605505 1 0.7289491 11.337287 0.7815507 0#> ENSMUSG00000029169 0.7155963 1 0.5757002 9.919718 0.6400346 0#> ENSMUSG00000035530 0.7064220 1 0.5973281 9.911616 0.6392257 0#> ENSMUSG00000067194 0.6819572 1 0.6865677 9.847681 0.6328430 0#> ENSMUSG00000045983 0.6636086 1 0.7275590 9.492561 0.5973914 0#> ENSMUSG00000026083 0.6574924 1 0.7455318 9.907629 0.6388276 0#> ENSMUSG00000024360 0.6880734 1 0.6783188 10.324937 0.6804876 0#> ENSMUSG00000028034 0.6452599 1 0.7883429 10.848090 0.7327141 0#> ENSMUSG00000027680 0.6972477 1 0.6197897 9.940147 0.6420739 0#> ENSMUSG00000018583 0.7186544 1 0.5649137 11.301127 0.7779409 0

Created on 2020-05-27 by the reprex package https://reprex.tidyverse.org (v0.3.0)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SydneyBioX/scMerge/issues/28#issuecomment-634359180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJVYTVQAABZ3Y4DOSUTXPELRTRQCNANCNFSM4NJTCNSA .

kevinwang09 commented 4 years ago

Hi, segIdx is sufficient, it is a summary of all the other components. Thanks, Kevin

adf-ncgr commented 4 years ago

Hi, in regard to the additional statistics, I was having some difficulty relating some of them to the description given in the paper. Is rho in the table the same as the lambda in the paper (representing the mixture coefficient)? Also, is the zero column equivalent to omega-star in the paper?
thanks!

YingxinLin commented 4 years ago

Hi,

The rho corresponding to the 1-lambda in the paper (i.e., rho represents the mixing proportion of the Gaussian component); and the zero in the table corresponding to the omega-star in the paper.

Yingxin

adf-ncgr commented 4 years ago

Great, thanks for clarifying that; one reason I was a little confused is that in the dataset to which I was applying it, the rho value is not correlated with segIndex, which seems strange since it is related to lambda (see correlation table below). It also seems strange that mu would be negatively correlated with segIdx in my dataset (the example data given with the package behaves more as I'd expect for both of these stats).

segIdx rho sigma mu mu.scaled zero
segIdx 1.000000000 0.008261562 -0.6582312 -0.6935119 -0.6935119 -0.8460289

I created the data that produced the above result from a 10x dataset as shown below; if you have any thoughts on what might account for the seemingly unusual behavior of the segIdx with respect to rho and mu I'd be interested to hear them. Possibly just some mistake I've made in getting the expression matrix transformed appropriately from 10x output? Thanks again, Andrew

library(DropletUtils) sce<-read10xCounts(samples="WT1/filtered_feature_bc_matrix/") library("SingleCellExperiment") counts <- assay(sce, "counts") libsizes <- colSums(counts) size.factors <- libsizes/mean(libsizes) logcounts(sce) <- log2(t(t(counts)/size.factors) + 1) exprs_mat = SummarizedExperiment::assay(sce, 'logcounts') library(scMerge) set.seed(1) result1 = scSEGIndex(exprs_mat = exprs_mat)

YingxinLin commented 4 years ago

Hi Andrew,

Our current SEG model was designed for the low-throughput theology data (such as SMART-seq). In particular, the mixture model does not fit the 10x UMI data well. If you are looking for negative control genes as scMerge input, probably consider using our derived SEGs.

Best wishes, Yingxin

adf-ncgr commented 4 years ago

Thanks very much for the further clarification and suggestion, you've been very helpful. Andrew