Closed galib36 closed 7 years ago
The exprs
are log-(normalized) count values, and are closer to log2(count + 1)
after adjusting the counts for differences in the library sizes between cells. They are not log-CPMs.
As for the NA
values, you're probably getting negative size factors. See the instructions in the computeSumFactors
documentation for how to deal with them.
Note that you should be using a range of sizes
from 20 to 100. With fewer sizes
, the size factor estimates are less precise, and if the sizes
are too small, you will get inaccurate results.
Thank you for your reply. However, the vignette says,
You can create a new SCESet
object using count data as follows. In this case, the exprs
slot in the object will be generated as log2(counts-per-million)
using the cpm
function fromedgeR
, with a “prior count” value of 1
In my dataset the for a specific gene in a specific cell, default exprs
gives 6.904828
and log2(cpm + 1)
using calculateCPM()
gives 10.333072.
I also tried to calculate as you mentioned,
log2(get_exprs(cdSceset, 'counts')[20,1]/colSums(get_exprs(cdSceset, 'counts')[,1:2])[1] + 1)
but the result is a very low number as expected.
The vignette is wrong and should be updated.
Your calculation is also incorrect. The library sizes need to be centred at unity across cells before using them to divide the counts, otherwise the log-expression values would not be on the same scale as the log-counts. Also note that centring is done across all cells by default.
But the centring is also not giving me the correct value
log2(get_exprs(cdSceset, 'counts')[20,1]/(scale(colSums(get_exprs(cdSceset, 'counts')), center=TRUE, scale=FALSE)[1] + 1))
Well, let's have a look at an example:
example(newSCESet)
exprs(example_sceset)[1,]
Calculating it manually:
lib.sizes <- colSums(counts(example_sceset))
lib.sizes <- lib.sizes/mean(lib.sizes)
log2(counts(example_sceset)[1,]/lib.sizes+1)
Both of these things give me the same results (running scater 1.4.0 on R 3.4.0).
Thank you very much. I missed the scaling to unity and instead did centring with mean zero. The results are now same in scater_1.4.0.
I have some problem with exprs, somehow, raw count 0 turned into "8.005696", I am wondering how this happened.
Please provide a minimum working example of code.
`data <- newSCESet(countData=count,phenoData=anno)
keep<-rowSums(counts(data)>0)>0
keep_data<-data[keep,]
mt<- c("ENSMUSG00000064341", "ENSMUSG00000064345","ENSMUSG00000064351", "ENSMUSG00000064354", "ENSMUSG00000064356","ENSMUSG00000064357", "ENSMUSG00000064358", "ENSMUSG00000064360", "ENSMUSG00000065947", "ENSMUSG000000643", "ENSMUSG00000064367", "ENSMUSG00000064368", "ENSMUSG00000064370")
keep_data<-calculateQCMetrics(keep_data, feature_controls = list(MT=mt)) counts(keep_data)[1:5,1:5]
exprs(keep_data)[1:5,1:5] `
Then counts output:
AAACCTGGTCCAGTTA.1 AAACCTGTCTGATACG.1 AAACGGGAGCGAGAAA.1 AAACGGGCACCCATTC.1
ENSMUSG00000051951 0 0 0 0 ENSMUSG00000089699 0 0 0 0 ENSMUSG00000025900 0 0 0 0 ENSMUSG00000025902 0 0 0 1 ENSMUSG00000104328 0 0 0 0 AAACGGGGTCTCACCT.1 ENSMUSG00000051951 0 ENSMUSG00000089699 0 ENSMUSG00000025900 0 ENSMUSG00000025902 0 ENSMUSG00000104328 0
exprs output:
AAACCTGGTCCAGTTA.1 AAACCTGTCTGATACG.1 AAACGGGAGCGAGAAA.1 AAACGGGCACCCATTC.1
ENSMUSG00000051951 8.005696 8.005696 8.005696 8.005696 ENSMUSG00000089699 8.005696 8.005696 8.005696 8.005696 ENSMUSG00000025900 8.005696 8.005696 8.005696 8.005696 ENSMUSG00000025902 8.005696 8.005696 8.005696 9.897117 ENSMUSG00000104328 8.005696 8.005696 8.005696 8.005696 AAACGGGGTCTCACCT.1 ENSMUSG00000051951 8.005696 ENSMUSG00000089699 8.005696 ENSMUSG00000025900 8.005696 ENSMUSG00000025902 8.005696 ENSMUSG00000104328 8.005696
The values added by newSCESet
are still log-CPMs at this point, you need to run normalize
to get log-normalized counts. This discrepancy will be fixed in the next version of scater.
I see. Thank you very much.
Hi I still have the problem. I tried to normalize the data using scran follow the tutorial:
qclust<-quickCluster(keep_data)
keep_data<-computeSumFactors(keep_data,clusters=qclust,positive=T)
keep_data <- normalize(keep_data)
Then I got:
exprs(keep_data)[1:5,1:5] AAACGGGAGCGAGAAA.1 AAACGGGGTGCAACTT.1 AAAGCAATCTCTAAGG.1 ENSMUSG00000025902 -2.885387e-06 -2.885387e-06 -2.885387e-06 ENSMUSG00000033845 -2.885387e-06 -2.885387e-06 7.861283e-01 ENSMUSG00000025903 -2.885387e-06 4.505800e-01 7.861283e-01 ENSMUSG00000033813 -2.885387e-06 -2.885387e-06 -2.885387e-06 ENSMUSG00000033793 -2.885387e-06 4.505800e-01 -2.885387e-06 AAAGTAGAGTCCTCCT.1 AAAGTAGAGTTAGGTA.1 ENSMUSG00000025902 7.524052e-01 -2.885387e-06 ENSMUSG00000033845 -2.885387e-06 -2.885387e-06 ENSMUSG00000025903 -2.885387e-06 -2.885387e-06 ENSMUSG00000033813 -2.885387e-06 -2.885387e-06 ENSMUSG00000033793 -2.885387e-06 -2.885387e-06
counts(keep_data)[1:5,1:5] AAACGGGAGCGAGAAA.1 AAACGGGGTGCAACTT.1 AAAGCAATCTCTAAGG.1 ENSMUSG00000025902 0 0 0 ENSMUSG00000033845 0 0 1 ENSMUSG00000025903 0 1 1 ENSMUSG00000033813 0 0 0 ENSMUSG00000033793 0 1 0 AAAGTAGAGTCCTCCT.1 AAAGTAGAGTTAGGTA.1 ENSMUSG00000025902 1 0 ENSMUSG00000033845 0 0 ENSMUSG00000025903 0 0 ENSMUSG00000033813 0 0 ENSMUSG00000033793 0 0
My questions are: 1: Following last question, even the values are log-CPM, how 0 ends up 8.005696? 2; After normalization, why raw reads 0 turned into a negative value? 3: Is log base 2 in this package?
Thank you very much.
Try using the latest version of scran (1.4.5) and scater (1.4.0). And yes, it's log 2.
The value of the log-CPMs is explained by adding a prior count to avoid undefined values after the log-transformation, multiplying by a million, and dividing by the mean library size. It's a long story to explain why, but it's irrelevant because you won't use those values anyway.
This issue was moved to davismcc/scater#18
Hello Davis, When I calculate the
cpm
usinglog20(calculateCPM(cdSceset)+1)
it comes different compared to the defaultexprs
value while creating thesce
object withcdSceset <- newSCESet(countData = cd, phenoData = pheno_data_fil).
The count data is the raw data. Also while I try to normalize the data usingit gives me
NA
for some of the cells. Could you please let me know what is going wrong.Thank you.
Best Regards, Syed