davismcc / archive-scater

An archived version of the scater repository, see https://github.com/davismcc/scater for the active version.
64 stars 18 forks source link

exprs and cpm are different #110

Closed galib36 closed 7 years ago

galib36 commented 7 years ago

Hello Davis, When I calculate the cpm using log20(calculateCPM(cdSceset)+1) it comes different compared to the default exprs value while creating the sce object with cdSceset <- newSCESet(countData = cd, phenoData = pheno_data_fil). The count data is the raw data. Also while I try to normalize the data using

cdSceset <- scran::quickCluster(cdSceset)
cdSceset <- scran::computeSumFactors(cdSceset, sizes = 15, clusters = qclust)
cdSceset <- scran::normalize(cdSceset)

it gives meNA for some of the cells. Could you please let me know what is going wrong.

Thank you.

Best Regards, Syed

LTLA commented 7 years ago

The exprs are log-(normalized) count values, and are closer to log2(count + 1) after adjusting the counts for differences in the library sizes between cells. They are not log-CPMs.

As for the NA values, you're probably getting negative size factors. See the instructions in the computeSumFactors documentation for how to deal with them.

Note that you should be using a range of sizes from 20 to 100. With fewer sizes, the size factor estimates are less precise, and if the sizes are too small, you will get inaccurate results.

galib36 commented 7 years ago

Thank you for your reply. However, the vignette says,

You can create a new SCESet object using count data as follows. In this case, the exprs slot in the object will be generated as log2(counts-per-million) using the cpm function fromedgeR, with a “prior count” value of 1

In my dataset the for a specific gene in a specific cell, default exprs gives 6.904828

and log2(cpm + 1) using calculateCPM() gives 10.333072.

I also tried to calculate as you mentioned, log2(get_exprs(cdSceset, 'counts')[20,1]/colSums(get_exprs(cdSceset, 'counts')[,1:2])[1] + 1) but the result is a very low number as expected.

LTLA commented 7 years ago

The vignette is wrong and should be updated.

Your calculation is also incorrect. The library sizes need to be centred at unity across cells before using them to divide the counts, otherwise the log-expression values would not be on the same scale as the log-counts. Also note that centring is done across all cells by default.

galib36 commented 7 years ago

But the centring is also not giving me the correct value

log2(get_exprs(cdSceset, 'counts')[20,1]/(scale(colSums(get_exprs(cdSceset, 'counts')), center=TRUE, scale=FALSE)[1] + 1))

LTLA commented 7 years ago

Well, let's have a look at an example:

example(newSCESet)
exprs(example_sceset)[1,]

Calculating it manually:

lib.sizes <- colSums(counts(example_sceset))
lib.sizes <- lib.sizes/mean(lib.sizes)
log2(counts(example_sceset)[1,]/lib.sizes+1)

Both of these things give me the same results (running scater 1.4.0 on R 3.4.0).

galib36 commented 7 years ago

Thank you very much. I missed the scaling to unity and instead did centring with mean zero. The results are now same in scater_1.4.0.

bchen4 commented 7 years ago

I have some problem with exprs, somehow, raw count 0 turned into "8.005696", I am wondering how this happened.

LTLA commented 7 years ago

Please provide a minimum working example of code.

bchen4 commented 7 years ago

`data <- newSCESet(countData=count,phenoData=anno)

keep<-rowSums(counts(data)>0)>0

keep_data<-data[keep,]

mt<- c("ENSMUSG00000064341", "ENSMUSG00000064345","ENSMUSG00000064351", "ENSMUSG00000064354", "ENSMUSG00000064356","ENSMUSG00000064357", "ENSMUSG00000064358", "ENSMUSG00000064360", "ENSMUSG00000065947", "ENSMUSG000000643", "ENSMUSG00000064367", "ENSMUSG00000064368", "ENSMUSG00000064370")

keep_data<-calculateQCMetrics(keep_data, feature_controls = list(MT=mt)) counts(keep_data)[1:5,1:5]

exprs(keep_data)[1:5,1:5] `

Then counts output:

               AAACCTGGTCCAGTTA.1 AAACCTGTCTGATACG.1 AAACGGGAGCGAGAAA.1 AAACGGGCACCCATTC.1

ENSMUSG00000051951 0 0 0 0 ENSMUSG00000089699 0 0 0 0 ENSMUSG00000025900 0 0 0 0 ENSMUSG00000025902 0 0 0 1 ENSMUSG00000104328 0 0 0 0 AAACGGGGTCTCACCT.1 ENSMUSG00000051951 0 ENSMUSG00000089699 0 ENSMUSG00000025900 0 ENSMUSG00000025902 0 ENSMUSG00000104328 0

exprs output:

               AAACCTGGTCCAGTTA.1 AAACCTGTCTGATACG.1 AAACGGGAGCGAGAAA.1 AAACGGGCACCCATTC.1

ENSMUSG00000051951 8.005696 8.005696 8.005696 8.005696 ENSMUSG00000089699 8.005696 8.005696 8.005696 8.005696 ENSMUSG00000025900 8.005696 8.005696 8.005696 8.005696 ENSMUSG00000025902 8.005696 8.005696 8.005696 9.897117 ENSMUSG00000104328 8.005696 8.005696 8.005696 8.005696 AAACGGGGTCTCACCT.1 ENSMUSG00000051951 8.005696 ENSMUSG00000089699 8.005696 ENSMUSG00000025900 8.005696 ENSMUSG00000025902 8.005696 ENSMUSG00000104328 8.005696

LTLA commented 7 years ago

The values added by newSCESet are still log-CPMs at this point, you need to run normalize to get log-normalized counts. This discrepancy will be fixed in the next version of scater.

bchen4 commented 7 years ago

I see. Thank you very much.

bchen4 commented 7 years ago

Hi I still have the problem. I tried to normalize the data using scran follow the tutorial:

qclust<-quickCluster(keep_data)
keep_data<-computeSumFactors(keep_data,clusters=qclust,positive=T)
keep_data <- normalize(keep_data) 

Then I got:

exprs(keep_data)[1:5,1:5] AAACGGGAGCGAGAAA.1 AAACGGGGTGCAACTT.1 AAAGCAATCTCTAAGG.1 ENSMUSG00000025902 -2.885387e-06 -2.885387e-06 -2.885387e-06 ENSMUSG00000033845 -2.885387e-06 -2.885387e-06 7.861283e-01 ENSMUSG00000025903 -2.885387e-06 4.505800e-01 7.861283e-01 ENSMUSG00000033813 -2.885387e-06 -2.885387e-06 -2.885387e-06 ENSMUSG00000033793 -2.885387e-06 4.505800e-01 -2.885387e-06 AAAGTAGAGTCCTCCT.1 AAAGTAGAGTTAGGTA.1 ENSMUSG00000025902 7.524052e-01 -2.885387e-06 ENSMUSG00000033845 -2.885387e-06 -2.885387e-06 ENSMUSG00000025903 -2.885387e-06 -2.885387e-06 ENSMUSG00000033813 -2.885387e-06 -2.885387e-06 ENSMUSG00000033793 -2.885387e-06 -2.885387e-06

counts(keep_data)[1:5,1:5] AAACGGGAGCGAGAAA.1 AAACGGGGTGCAACTT.1 AAAGCAATCTCTAAGG.1 ENSMUSG00000025902 0 0 0 ENSMUSG00000033845 0 0 1 ENSMUSG00000025903 0 1 1 ENSMUSG00000033813 0 0 0 ENSMUSG00000033793 0 1 0 AAAGTAGAGTCCTCCT.1 AAAGTAGAGTTAGGTA.1 ENSMUSG00000025902 1 0 ENSMUSG00000033845 0 0 ENSMUSG00000025903 0 0 ENSMUSG00000033813 0 0 ENSMUSG00000033793 0 0

My questions are: 1: Following last question, even the values are log-CPM, how 0 ends up 8.005696? 2; After normalization, why raw reads 0 turned into a negative value? 3: Is log base 2 in this package?

Thank you very much.

LTLA commented 7 years ago

Try using the latest version of scran (1.4.5) and scater (1.4.0). And yes, it's log 2.

The value of the log-CPMs is explained by adding a prior count to avoid undefined values after the log-transformation, multiplying by a million, and dividing by the mean library size. It's a long story to explain why, but it's irrelevant because you won't use those values anyway.

davismcc commented 7 years ago

This issue was moved to davismcc/scater#18