Closed lcolladotor closed 3 years ago
If you've gotten to the point of debugging the function, it should be pretty clear that your zero rescaling factor is being caused by the presence of >50% zero counts in cur.prof
, which in turn must be caused by the low counts for cluster 89. It is very unusual for the rescaling factor to be zero. In fact, it is so unusual that I have never seen that error triggered on real data. It means that more than 50% of the average expression values in a cluster are zero - even after filtering on min.mean
!
The typical solution is to filter out the problematic cells. The 493 figure, by itself, is not necessarily relevant as quickPerCellQC()
uses other metrics; you may observe that the true reason for filtering is that of the number of features or the mitochondrial proportions. And also, 493 is hardly "high"; that's already a rather relaxed QC threshold in 10X scRNA-seq contexts. Just because they're biologically relevant doesn't necessarily mean that the data is any good for those layers.
If you must retain these spots, I'm not sure any "sensible" normalization to remove composition bias can be done here. The whole rationale about removing composition bias is based on the concept of most genes being non-DE, but when most genes have zero counts... there's not really all that much that can be done. I guess I could fall back to using mean(cur.prof)/mean(ref.prof)
, which represents the ratio of effective library sizes, but this may or may not be sensible.
Hi Aaron,
Thanks a lot for your reply! We are still parsing the information you provided us and I wanted to update you on our efforts. Overall, we'll try to understand more & discuss with our colleagues the implications of using mean(cur.prof)/mean(ref.prof)
as a fallback for cluster 89 in situations like ours.
We also want to:
quickPerCellQC()
suggested discarding (as well as the colSums(counts)
), so we'll work on these plots too.spaceranger
summary metrics such as the "average number of reads per spot" to check if we need to generate more sequencing data for our new samples compared to our initial data from https://github.com/LieberInstitute/HumanPilot/blob/master/Analysis/visium_dlpfc_pilot_sample_metrics.tsv#L3Today I tried explaining to Abby and Brenda chapter 7.3 from the OSCA book http://bioconductor.org/books/release/OSCA/normalization.html#normalization-by-deconvolution. It might be good to check in more detail http://bioconductor.org/books/release/OSCA/normalization.html#ref-lun2016pooling and http://bioconductor.org/books/release/OSCA/normalization.html#ref-robinson2010scaling.
Best, Leo
scuttle 1.1.15 uses the fallback I described above, so you'll at least get something back for cluster 89.
Hi Aaron,
Thank you for implementing the changes in https://github.com/LTLA/scuttle/commit/0ed602b33c839b9cad770d5871b7436ebe993caf. With this new version we get a rescaling value for this problematic cluster. Unlike what we were seeing before, with #8, we can see that the quick cluster that triggers this warning (formerly error) is actually bad on metrics such as low number of features and counts. So it makes complete sense now. Before we were not understanding why a potentially ok quick cluster would trigger the warning / error.
We've also inspected the spaceranger
metrics and compared to other samples we have, the sample where this problematic cluster is located at, has half the mean reads per spot as the next lowest one: 25k vs 50k (50k is the value recommended by 10x Genomics). That is, generating more sequencing data will likely resolve this. And if not, we'll have to either discard these spots (and see if we can find a better threshold to remove as few spots as possible) or check if we can get fewer quick clusters (we have 95) so there'll be less sparsity after pseudo-bulking than currently observed (maybe changing the block
argument in quickCluster()
will help).
That is, we'll attempt to solve the problem with new data instead of any statistics / code.
Thanks for everything Aaron!
Best, Leo
rescaling.factors <- scuttle:::.rescale_clusters(clust.profile, ref.col=ref.clust, min.mean=min.mean)
# Warning message:
# In scuttle:::.rescale_clusters(clust.profile, ref.col = ref.clust, :
# inter-cluster rescaling factor for cluster 64 is not strictly positive,
# reverting to the ratio of average library sizes
As noted in #8, internal cluster 64 is actually the original cluster 85. (We originally had internal cluster 89 previously when we filtered to spots with at least 11 counts; though likely internal cluster 89 might have matched the original cluster 85 since 89 also has decent quickPerCellQC metrics).
## Original cluster 89
# scran_low_lib_size scran_low_n_features scran_high_subsets_Mito_percent
# TRUE : 62 TRUE : 86 TRUE : 2
# FALSE:210 FALSE:186 FALSE:270
# scran_discard
# TRUE : 87
# FALSE:185
## Original cluster 64
# scran_low_lib_size scran_low_n_features scran_high_subsets_Mito_percent
# TRUE : 0 TRUE : 0 TRUE : 56
# FALSE:542 FALSE:542 FALSE:486
# scran_discard
# TRUE : 56
# FALSE:486
## Original cluster 85 (the bad one!) which gets assigned cluster 64 internally
## OK! Finally this cluster does look bad! I mean, it was weird the previous clusters didn't seem bad before!
# scran_low_lib_size scran_low_n_features scran_high_subsets_Mito_percent scran_discard
# TRUE :159 TRUE :159 TRUE : 55 TRUE :159
# FALSE: 0 FALSE: 0 FALSE:104 FALSE: 0
For internal cluster 64 (original 85) we end up with the following rescaling
value.
# > rescale.sf
# [1] 0
## Since it's 0, that triggers the warning (formerly error)
## https://github.com/LTLA/scuttle/blob/3cb28efbf237ecb9b7a1973ab7e8957371260a32/R/pooledSizeFactors.R#L412-L442
## See https://github.com/LTLA/scuttle/commit/0ed602b33c839b9cad770d5871b7436ebe993caf
## for changes
if (!is.finite(rescale.sf) || rescale.sf <= 0) {
warning(paste(strwrap(paste0("inter-cluster rescaling factor for cluster ", clust,
" is not strictly positive, reverting to the ratio of average library sizes")), collapse="\n"))
rescale.sf <- sum(cur.prof)/sum(ref.prof)
# > rescale.sf
# [1] 0.003733694
# > sum(cur.prof)
# [1] 13.62893
# > sum(ref.prof)
# [1] 3650.254
}
# > summary(cur.prof)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.000000 0.000000 0.000000 0.004306 0.004082 0.552340
# > summary(ref.prof)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.0000 0.4312 0.6260 1.1533 1.1144 57.0261
# > length(cur.prof)
# [1] 3165
# > length(ref.prof)
# [1] 3165
# > addmargins(table(cur.prof == 0))
#
# FALSE TRUE Sum
# 1431 1734 3165
# > addmargins(table(ref.prof == 0))
#
# FALSE TRUE Sum
# 3159 6 3165
save(all.norm, clusters, cur.prof, ref.prof, min.mean, file = here::here("rdata", "inspect_scuttle_issue_7.Rdata"))
Here's the inspect_scuttle_issue7.Rdata
file (22 Mb) where clusters
are the internal clusters (so look for 64 in this case, not 85).
R session info
Hi Aaron,
We (@bpardo99 & @abspangler13) have an SCE object from Visium data that has some low
colSums(counts)
that we are trying to process with code similar to the one from https://github.com/LieberInstitute/HumanPilot/blob/master/Analysis/sce_scran.R#L82-L96 that was used for generating http://spatial.libd.org/spatialLIBD/.Given that
quickCluster()
takes a bit to run, we initially ran it with the spots (equivalent to cells; aka cols in the SCE) that hadcolSums(counts) > 0
to avoid the error prompted by https://github.com/LTLA/scuttle/blob/b1cb2949bab45ec698ab426b464cd8f6306f0ef1/R/pooledSizeFactors.R#L321. However, some havecolSums(counts) <= 10
for example, so maybe we need a stricter filter.In any case, when running
computeSumFactors()
we are running into https://github.com/LTLA/scuttle/blob/b1cb2949bab45ec698ab426b464cd8f6306f0ef1/R/pooledSizeFactors.R#L435.Here's the basic structure of our code.
We then tried subsetting the data to
colSums(counts) > 10
and100
without re-runningquickCluster()
and got the same error (we'll try re-runningquickCluster()
after filtering forcolSums(counts) > 10
but it might not work either; aka, the quick clusters assignments might remain the same ones).With
>10
and usingoptions(error = recover)
from https://rstats.wtf/debugging-r-code.html#debugging-others-code we saw that we have 95 clusters and only cluster 89 (that has 272 spots) hasrescaling == 0
as checked in https://github.com/LTLA/scuttle/blob/b1cb2949bab45ec698ab426b464cd8f6306f0ef1/R/pooledSizeFactors.R#L434.At that point we have
min.mean = 0.1
and 4 entries withref.prof
equal to 0. I'm struggling here to see the relationship between the low counts and therescaling = 0
with quick cluster 89.Note that everything does work if we use
quickPerCellQC()
to find spots (cells) to discard similar to https://github.com/LieberInstitute/HumanPilot/blob/master/Analysis/sce_scran.R#L32-L33 and https://github.com/LieberInstitute/HumanPilot/blob/master/Analysis/sce_scran.R#L61. However, that leads to dropping spots (cells) that are biologically relevant (layer-specific in the DLPFC). These "discard" spots end up having acolSums(counts)
as high as 493.So well, do you have some advice on what the minimum
colSums(counts)
should be?Best, Leo