MarioniLab / scran

Clone of the Bioconductor repository for the scran package.
https://bioconductor.org/packages/devel/bioc/html/scran.html
40 stars 22 forks source link

MNN correction generates NaN or Inf values #9

Closed ahy1221 closed 6 years ago

ahy1221 commented 6 years ago

I found that after MNN correction , some genes were assign NaN or Inf values. Such values impeded further MNN correction ( in the nested setting ). It seems that such NaN or Inf values are generates by zero counts genes. I am wondering that what is the reasonable way to handle these NaN or Inf values ?

LTLA commented 6 years ago

Code?

ahy1221 commented 6 years ago

Sorry for that, this is the code

rowData(sce)$AVE <- calcAverage(sce)
sce <- sce[rowData(sce)$AVE > 1, ]
clusters <- quickCluster(sce, min.size = 15, method = "igraph", min.mean = 1 )
sce <- computeSumFactors(sce,  sizes = seq(10, 30, 5),
                          clusters = clusters,  sf.out = F, positive = T)

sce <- normalise(sce)

#==== fitting HVGs

var.fit <- trendVar(sce, parametric = T, block = sce$donor, use.spikes = F)

dec.var <- decomposeVar(sce, fit = var.fit)

dec.var <- dec.var[complete.cases(dec.var),]

dec.var <- dec.var[order(dec.var$bio, decreasing = T),]

hvg <- rownames(dec.var[dec.var$bio > 0.1 & dec.var$FDR < 0.1,])

logcount.mt.list <- lapply(unique(sce$donor), function(d) {
 logcounts(sce)[, sce$donor == d]
})

names(logcount.mt.list) <- unique(sce$donor)

mnn.batch.bydonor <- mnnCorrect(logcount.mt.list$D20171109, logcount.mt.list$D20170412,
                                logcount.mt.list$D20170327, logcount.mt.list$D20170322,
                                logcount.mt.list$D20170227, logcount.mt.list$D20170222,
                                k = 20,
                                subset.row = hvg, 
                                pc.approx = T,
                                BPPARAM = MulticoreParam(workers = 6 ))

Then I calculated whether there is NaN or Inf value in each cell: default

default

ahy1221 commented 6 years ago

I think this issue may be more about data, so just let me know if you need more detailes

LTLA commented 6 years ago

Do you have all-zero cells?

ahy1221 commented 6 years ago

There is no all-zero cells in this dataset, because I did the QC based on the total_features default

But there are genes only expressed in one cells. I am not sure whether this is the reason cause that. default

LTLA commented 6 years ago
  1. Check that none of the logcount.mt.list$D20171109, etc. entries have NA values.
  2. Check that colSums(logcount.mt.list$D20171109[hvg,]) are non-zero.

If those two things are fine, then I don't know what the problem is. I'll have to actually look at the data.

ahy1221 commented 6 years ago

Thank you very much. I found that there are many Inf value in the logcounts matrix for some cells. Is there some wrong processing about normalization ? default

LTLA commented 6 years ago

I'm not sure how you managed to get infinite but non-NA values from normalize. Even if you had negative size factors, you should have gotten NAs. The only way to get infinite normalized values would be to have size factors of exactly zero, which should not be possible.

Also, I don't recommend going below a sizes of 20. And I don't think that the normalization will be stable if you have fewer than 100 cells per level of clusters.

ahy1221 commented 6 years ago

Thanks for your replying . You are right that the issue comes up with size factors of zero. I am trying to redo the normalization procedure and it seems that it works fine now. Regarding the sizes for deconvolution, if I use default setting, there would be warnings saying that In .computeSumFactors(assay(x, i = assay.type), subset.row = subset.row, : not enough cells in at least one cluster for some 'sizes' , would that hurts ?

LTLA commented 6 years ago

The warnings are trying to protect you by telling you that you don't have enough cells. Setting the sizes to avoid the warnings doesn't really fix the problem. The alternative is to not pre-cluster, and use all cells for pooling in computeSumFactors; this will improve precision at the cost of increasing bias.

For an arbitrary data set, I can't say whether this is better or worse than your current approach. It's not a problem I've really encountered. For small data sets, the cells are usually very homogeneous, so I didn't need to pre-cluster, and for large data sets, I've got enough cells such that clustering can be done. If you have the worst-case scenario of few cells from a heterogeneous population, you can imagine that there is very little information that can be meaningfully shared between cells.

ahy1221 commented 6 years ago

Thank you very much! The Inf and NaN value are both caused by low-quality cells. After redoing QC to filter out those cells, all things look fine now. We can close this issue.