Closed MichaelPeibo closed 6 years ago
It is usually good form to post a minimum example of the code that is causing issues for you.
I assume you are reading Section 5.1 of:
https://bioconductor.org/help/workflows/simpleSingleCell/part3/
I also assume you are sorting by the biological component, not the p-value. And I assume that you are running this on the log-normalized counts, after dividing the counts by the size factors.
Hi, Aaron Here is example of my code: `#normalization ave.counts <- calcAverage(sce.test) rowData(sce.test)$ave.count <- ave.counts to.keep <- ave.counts > 0 sce.test <- sce.test[to.keep,] summary(to.keep)
high.ave <- rowData(sce.test)$ave.count >= 0.1 clusters <- quickCluster(sce.test, subset.row=high.ave, method="igraph") sce.test <- computeSumFactors(sce.test, cluster=clusters, subset.row=high.ave, min.mean=NULL) sce.test <- normalize(sce.test)
var.fit <- trendVar(sce.test, use.spikes=FALSE, span=0.2) var.out <- decomposeVar(sce.test, var.fit)
design <- model.matrix(~ G1 + G2M, assignment.test$score) fit.block <- trendVar(sce.test, design=design, parametric=TRUE, use.spikes=FALSE) var.block <- decomposeVar(sce.test, fit.block)
hvg.out <- var.block[which(var.block$FDR <= 0.05),] hvg.out <- hvg.out[order(hvg.out$bio, decreasing=TRUE),]` after running this, I got top HVGs as house-keeping genes. Any suggestion?
The code looks fine, assuming that you only have one experimental batch. One possibility is that the trend hasn't fitted properly, especially at high abundances where there are few points. In such cases, you can try tweaking some of the fitting parameters. (The trendVar
function in the BioC-devel branch of scran provides more flexibility with regards to possible parameters.)
The other possibility is that the counts are so low, only the highly expressed genes have any chance of obtaining large positive biological components. For lowly expressed things like TFs, there's not much opportunity for defining highly variable genes when you can only choose between 0 and 1.
If I were you, I would proceed with further analysis, i.e., dimensionality reduction and clustering. Indeed, it seems like you are assuming that the "house-keeping" genes are not interesting. I would take a step back and see what the data says. You haven't said what cells you're working on, but Acta2 and Tpm1 look like muscle proteins to me, and could be very interesting in certain contexts.
Hi, Aaron, thanks for patient explanation.(it is quite amazing you can 'assume' my results exactly)
1). My cell is neuron and differentiating neuron. I continued downstream analysis, but the results is not what I want, compared to the clustering results using Seurat; there are many uninteresting genes as marker genes such as phosphoproteins. I assume that bad clustering results is highly related to HVGs defined, and since HVG defining method is different with Seurat package.
2) I tried to tune the trendVar function, because I assume that what HVGs are used for downstream analysis is produced here. I tried set min.mean=0.01 or 0.001, and the shape of fitting curve and HVGs number is quite similar with min.mean=0.1; I also tuned the span with 0.2 and 0.4, even though I do not know the function of it, the results are similar, one example is shown below:
technical variance.womtrp.0.001.span0.4.pdf typical tSNE plot is shown here, I thinks the clustering is not good:
3) one thing to notice is, after I denoisePCA, I got few PCs
denoisePCA(sce.test.norm, technical=fit.block$trend, design=design) ncol(reducedDim(sce.test.norm, "PCA")) [1] 5
quite fewer than PCs in workflow
Your trend fit looks fine, but I don't know why the top HVGs are all "uninteresting" genes in your case. I have analyzed a few neuronal data sets myself, and I have usually gotten reasonable HVG sets. See, for example, the analysis of the 10X 1.3 million brain cell data set (https://doi.org/10.1101/167445), where the top HVGs are very much neuron-related genes. And obviously the Zeisel data in the workflow.
Indeed, your trend fit is quite unambiguous regarding the identity of the most highly variable genes. The top handful of points on the plot with variances greater than 4 are clearly the most heterogeneous features in this data. Are you saying that they are all uninteresting? The only advice I can give is to check whether you have modified the order of your rows at some point.
Regarding the number of PCs; this is a consequence of assuming that most genes are driven by technical noise. I do not think this assumption is true for UMI data; see how the technical trend is always below the variances in Figure 6 of https://bioconductor.org/help/workflows/simpleSingleCell/part2. If you are using the BioC-devel branch, you can try the makeTechTrend
function, which creates a trend based on Poisson noise. Alternatively, you can just set the minimum rank to something like 20 or 100 for the time being.
P.S. If the neurons are not dividing, I would not bother regressing out the cell cycle phase; this just complicates matters unnecessarily. In fact, I would not regress out cell cycle phase even if the neurons were dividing, as the cell cycle is often correlated with biological processes of interest.
Hi, Aaron I rechecked my code following workflow, which logically the same. I take the result as what it is.
One theoretical problem for plotTSNE, I see one parameter to tune is perplexity, I tried tune it on my dataset, which looks like this: SNNcluster.cellcyclblock.per100.pdf SNNcluster.cellcyclblock.per50.pdf SNNcluster.cellcyclblock.per20.pdf SNNcluster.cellcyclblock.per10.pdf SNNcluster.cellcyclblock.per5.pdf SNNcluster.cellcyclblock.per700.pdf SNNcluster.cellcyclblock.per80.pdf
Any suggestion for the 'best' tsne plot (probably clearly separated and aesthetic), and how to determine a proper value for perplexity(or rand_seed?)?
BTW, cell number is around 3700.
Thanks!
These plots look much more interesting than the one you showed before. What did you change?
As I said before, the highly variable genes in your plot are pretty clear. I would keep an open mind regarding how "interesting" they are, rather than dismissing them out of hand.
Anyway, an excellent resource on understanding the t-SNE parameters is:
https://distill.pub/2016/misread-tsne/
... where the perplexity always matters. I don't think I have a good suggestion for the best perplexity; van der Maaten says that it is comparable to the number of nearest neighbours, but I'm not sure whether this would help you pick a good value. If you are worried, I would just repeat it with different perplexities and check that the conclusions are robust to the choice, which seems to be the case for your data set.
rand_seed
is just any number to set the random seed. It can be anything - the exact value doesn't matter, as t-SNE is a stochastic technique anyway, so one choice is as good as another.
Hi, Aaron
First, I removed the control genes --MTs and Ribosome genes before normalization, because I saw many Rp genes appear in my heatmap gene set, and I am inspired by another package Ascend and scater work flow paper to remove them;
Also, I set min.mean=1e-5
for computeSumFactors
, this is recommended also by
Ascend's authors for drop-seq data;
And I block on the cell cycle, which produced 3 more clusters than unblock;
I am still exploring my data, because I am quite new to this and try to understand the analysis theory, I will update some experience here(probably most are Qs :) ). It seems that some trouble in drop-seq data analysis is quite different with others.
So are you saying that the removal of mitochondrial/ribosomal protein genes yielded better t-SNE plots? This is... possible, though I find that such genes are not highly variable in most cases. Or if they are, that may be indicative of some other technical problems, e.g., a large subpopulation of damaged cells driving high variability in the mitochondrial counts and/or ribosomal protein counts.
In any case, the above points do not resolve your previous problem about the top HVGs being "uninteresting". I can only assume that the top HVGs were, in fact, somewhat interesting, otherwise it would be difficult to be interested in any clustering or t-SNE based on the HVGs.
For computeSumFactors
, I would really set min.mean
higher, probably 0.1. I don't know how the ascend authors settled on 1e-5
, but there's a reason why I recommend 0.1 for UMI data in the workflow. Below this, the pools of cells may not contain sufficient counts, which reduces both the accuracy and precision of the size factor estimates. (Specifically, the Central Limit Theorem does not kick in, and the pooled counts will not look normal, which is necessary for the median-based estimator to be unbiased. Also, lower counts have higher variance in the ratios, which reduces precision.) In the worst case, you will see the dreaded "Negative size factor estimates" warning, though inaccuracy arises well before then.
From anecdotes, Drop-seq seems to be more challenging than 10X in terms of analysis, apparently due to its lower coverage. However, I have never analyzed it personally, so I cannot say much.
In any case, I am closing this issue. If you have further questions, either raise a new issue or create a post on the Bioconductor support site; the latter is preferred.
Thanks Aaron, I will explore more by tuning the parameters.
Hi, really powerful package! I am currently using data set from 10X to try the package, which is lack of spike-ins. So I am confused that how should I use the normalization as well as trendVar function to deal with my data? (Actually, I tried to set use.spikes=FALSE, assuming that technical noise is the major contributor to the variance of most genes in the data set, however, I got top HVGs as house-keeping genes such as ACTA2, TPM1.) Best.