Closed learning-MD closed 1 year ago
I (and I think many others) would be also very intersted in those answers, espeicially concerning the correct input.
An official response e.g. from @amissarova would be greatly appreciated.
Based on answers from MikeDMorgan in miloR, I think using harmony/integrated assay from Seurat would be the counterpart to pca.corrected that milo team uses in assign_neighborhoods
. Correct?
I think it is confusing that miloDE uses different functions than miloR. So e.g. is miloDE::assign_neighbourhoods
the same as miloR::buildGraphs
and miloR::makeNhoods
One pitfall is that when you use the integrated (so batch corrected) input from Seurat with
as.SingleCellExperiment(seu, assay = "integrated)
you only have a logcounts, but not a counts assay. This works fine for miloR, but does not work for miloDE, e.g. for assign_neighbourhoods
Hey @learning-MD and @mihem , thanks for your interest in the package. I'll try to answer to both lines of questions in one thread, pls let me know if something is unclear or if you have follow-up questions.
I hope this answer clarifies a bit how I recommend people to think about embedding choices when they apply miloDE.
). I want to emphasize though, that in the presence of batch effects (which is almost the case), unfortunately, using Augur we can not distinguish batch effect from potential DE, and after running this rather time-consuming part of the algorithm, you might end up with a very little gain (i.e. vast majority of the neighbourhoods will be selected for the downstream analysis - negligible gain in terms of multiple testing correction). But, to be fair, another class of neighbourhoods that our Augur-based neighbourhood selection identifies as inappropriate for DE testing - are neighbourhodos with very few cells in at least one condition which sometimes is very possible, so it might be useful. To put this confusing paragraph in a more detailed pipeline, I suggest the next approach toward neighbourhood selection.before running DE testing itself, assess your data in 2 ways: 1 - do you anticipate batch effects in your data or rather not? 2 - do you observe a lot of highly DA neighbourhoods (say with less than 5-10 cells in at least one condition).
If you do not anticipate much batch effect - consider running nighbourhood selection: miloDE::calc_AUC_per_neighbourhood
. You can use a standard cut-off of AUC of 0.5 to select neighbourhoods for follow-up DE testing.
If you anticipate batch effect but you see that you have quite a lot of neighbourhoods with very few cells in at least one condition - consider running a custom script that quickly scans those neighbourhoods and discards them from the DE testing (it will be much quicker than building Augur classifiers). Atm it is not supported in the main branch, but I consider adding it later and let me know if that's of interest, I can help drafting a script to do just that.
Finally, if you anticipate batch effect, as well as there, are not many imbalanced neighbourhoods - I think you can safely proceed to DE testing without any neighbourhood selection.
If in your case the difference between ‘bulk’ and ‘miloDE’ seems overwhelming, I would recommend manually expecting miloDE results to provide more confidence in the results.
Few suggestions on what you can do:
you say you detect 2 modules that agree with expectations based on animal studies. Plot a few genes from the modules individually (you can use miloDE::plot_DE_single_gene
as well as simply plot UMAPs colored by counts and faceted by conditions). Does this look like what you would expect based on the ‘module signature’?
In the modules signature plots - do neighbourhoods with high DE seem to be close to each other or are they rather sporadic? If they seem too sporadic - this is the sign of FP detection, but if they seem rather coherent and near by - then it is something interesting to investigate further (i.e. potentially relevant).
If the neighbourhoods cluster together and assume they are annotated within the same CT - what is the difference between the rest of the neighbourhoods of this CT? One way would be to select only reference data and perform DE detection between neighbourhoods (the ones that are flagged as very DE in miloDE VS the rest). Do you see some plausible biological markers that separate two groups of neighbourhoods? I believe that for this task you can use miloR::findNhoodMarkers
is tailored to do something akin to what Im suggesting.
Finally, using these markers - can you further subcluster your cell type. If yes, do you now observe DE in the sub-cell type of where you expect it to see? If yes - this is the good sign that miloDE happened to be sensitive enough to pick the local sub-cell type difference that bulk DE missed.
is a wrapper for these 2 miloR functions, but with some additional tweaks/options:
We allow (we actually encourage) the 2nd-order kNN graph assignment (the standard kNN is also possible, you need to select order=1
We provide a neighbourhood refinement step.In principle, you can use original Milos functions, but in this case, you need to force your k
to be rather big as well as you might end up with too many overlapping, sort of redundant, neighbourhoods. So I would encourage you to use miloDE’s assignment functions.
Thanks @amissarova - that is an incredibly thorough response! I think Harmony should be okay to use, but I'll test out a couple of other strategies as well - mainly, I may just focus on 1) integrating together the control samples first followed by reference-mapping of the disease onto the controls and 2) not integrating samples in the first place (since, by visualization of the unintegrated samples, I cannot really make out a batch effect - samples were hashtagged, so it goes a long ways in minimizing batch effects). The latter is the reason why I think using the Harmony embeddings as I did is okay.
Trying to plot gene modules as you suggested, one issue I've run into with plot_DE_single_gene
and plot_DE_gene_set
is that I keep running into the following error that I'm having difficulty troubleshooting:
Error in .check_nhood_stat(nhood_stat, x) :
'nhood_stat$Nhood' should be within c(1:ncol(nhoods(x))).
Any suggestions there? Thanks.
hey, @learning-MD
for plotting error - can you open a new issue pls?
But judging by error message, it looks like there is mismatch between neighbourhoods ids from the nhood_stat and nhoods(x)
. DId you ensure that nhood_stat is calculated from the same milo object you are using for plot_DE_single_gene
Also, when you open an issue - can you print out (assuming your milo object called sce_milo):
# if you use subset_nhoods, print them out too
setdiff(unique(nhood_stat$Nhood) , c(1:ncol(nhoods(sce_milo))))
Also @learning-MD , I just realised that when you were asking about neighbourhood filtering, you might have meant filtering at the stage of neighbourhood assignment? In this case yeah, it is mainly for computing time (and on rather rare occasions it might help for multiple testing correction issue, but rarely so)
@amissarova Thanks for your fast and very thorough response.
Actually, I am mainly interested in the 1. and 5. point. So here is as a reproducible example from .
Let's downsample the dataset so the pipeline runs faster:
seu <- subset(x = panc8, downsample = 500)
And now just run the rest of the code
pancreas.list <- SplitObject(seu, = "tech")
pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000,
verbose = FALSE)
reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2")]
pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
DefaultAssay(pancreas.integrated) <- "integrated"
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30, verbose = FALSE)
p1 <- DimPlot(pancreas.integrated, reduction = "umap", = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", = "celltype", label = TRUE, repel = TRUE) + NoLegend()
p1 + p2
Now we're done with integration and can continue with milo.
Now the important step: Let's use the integrated assay, because that's the batch-corrected assay.
sce_integrated <- as.SingleCellExperiment(pancreas.integrated, assay = "integrated")
Now miloDE:
sce_milo <- assign_neighbourhoods(
k = 30,
prop = 0.1,
d = 30,
order = 2,
filtering = TRUE,
reducedDim_name = "PCA",
verbose = TRUE
But this fails, because there is no counts assay
Error in fun(arg) :
x should contain 'counts' assay that will be used to calculate DE. If counts are stored in different assay, please move them to slot 'counts'.
I could use the RNA assay instead, then the function runs fine. But then I use raw uncorrected counts, right?
sce_rna <- as.SingleCellExperiment(pancreas.integrated, assay = "RNA")
sce_milo <- assign_neighbourhoods(
k = 30,
prop = 0.1,
d = 30,
order = 2,
filtering = TRUE,
reducedDim_name = "PCA",
verbose = TRUE
I find it confusing that the miloR functions work fine with the integrated (so no counts assay).
milo_integrated <- buildGraph(sce_integrated, k = 30, d = 30)
milo_integrated <- makeNhoods(milo_integrated, prop = 0.1, k = 30, d = 30, refined = TRUE)
And even more that the author of Augur says that It's fine to use raw RNA counts as input.
I think a lot of users would be very happy, if you could provide a vignette starting with a Seurat object. E.g. this is very nicely done in Harmony:
Thank you!
Thanks @amissarova - the plot_DE_gene
functions ran fine with me just reloading the milo object. Regarding some of the suggestions you made about assessing for whether the two modules were false positives or true, below are some plots of the cell identities, module DE, and a few individual genes from a module. Of note, this is related to an autoimmune disease, so we expect multiple immune cell types to be involved (based on animal models, for example, monocytes, NK cells, and CD4 T cells have all been implicated with interferon gamma thought to be playing a key role).
The below were built using the default parameters from the vignette. With that in mind, not sure whether you'd consider the below to be "sporadic" and inconsistent with true positives:
Apologies for these potentially basic questions - it's our first foray into cluster-free DE analyses.
@learning-MD , it is very hard for me to give a proper answer for the system I don't really know so I cant help you much w it. few random thoughts:
to something (try 100) - so all the edges are not that overwhelming.Thanks @amissarova - appreciate the guidance.
@amissarova just a friendly reminder of my last message in case you missed it because this issue is so busy. It was just yesterday, so please don't feel rushed.
Hey @mihem,
In the pipeline, when you run this line:
sce_integrated <- as.SingleCellExperiment(pancreas.integrated, assay = "integrated")
> sce_integrated
class: SingleCellExperiment
dim: 2000 5683
assays(1): logcounts
rownames(2000): COL3A1 COL1A1 ... SHPK CRYBB2P1
rowData names(0):
colnames(5683): D101_5 D101_7 ... HP1526901T2D_N8 HP1526901T2D_A8
colData names(9): orig.ident nCount_RNA ... dataset ident
reducedDimNames(1): PCA
mainExpName: integrated
you choose to move integrated modality which is not the original counts/logcounts. It is calculated only on 2k variable genes as well as the values correspond to integrated (or ‘batch-corrected’) expression matrix for all cells. What you really care to save tho is the calculated on this integrated matrix PCA - that will be your embedding to be used by miloDE. So if you want to move both counts/logcounts, you need to run this:
sce_integrated <- as.SingleCellExperiment(pancreas.integrated, assay = "RNA")
> sce_integrated
class: SingleCellExperiment
dim: 34363 5683
assays(2): counts logcounts
rownames(34363): A1BG-AS1 A1BG ... ZRSR1 pk
rowData names(0):
colnames(5683): D101_5 D101_7 ... HP1526901T2D_N8 HP1526901T2D_A8
colData names(9): orig.ident nCount_RNA ... dataset ident
reducedDimNames(1): PCA
mainExpName: RNA
So now you have the counts and logcounts both stored in your SCE object.
However, this is actually not the right object yet to work with, because it only contains integrated reference data - in, the second part of the vignette shows how now map your query data onto reference. I assume, that for miloDE you are interested in both reference + query - so now you need to project your query data onto reference, and then concatenate. In the vignette from Seurat, they show how to project cell type labels and UMAPs, but we need to project actual PCs. I made a little script to do this - please note that Im really not v proficient with Seurat so it is v possible there are better ways to do it, but this one should work:
seu = panc8
# split combined object by samples
pancreas.list <- SplitObject(seu, = "tech")
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000,
verbose = FALSE)
# assign reference and query samples - here I follow the mapping vignette from Seurat
ref_samples = c("celseq", "celseq2", "smartseq2")
query_samples = c("fluidigmc1")
# find anchors and integrate reference samples
reference.list <- pancreas.list[ref_samples]
pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
# calc PCA for inetgrated data - thats what we actually want ultimately
DefaultAssay(pancreas.integrated) <- "integrated"
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
# now lets move to SCE reference object, but with the original RNA assay which contains raw counts as well as normalised
sce_reference = as.SingleCellExperiment(pancreas.integrated , assay = "RNA")
# now lets project pca from reference onto query samples and return SCE with calculated PCA
DefaultAssay(pancreas.integrated) = "integrated"
sce_query = lapply(query_samples , function(current.sample){
current.seu_query = pancreas.list[[which(names(pancreas.list) == current.sample)]]
query_anchors <- FindTransferAnchors(reference = pancreas.integrated, query = current.seu_query,
dims = 1:30, reference.reduction = "pca")
current.pca_query <- TransferData(anchorset = query_anchors, refdata = t(Embeddings(pancreas.integrated[['pca']])), dims = 1:30)
current.pca_query = current.pca_query[1:nrow(current.pca_query), 1:ncol(current.pca_query)]
current.seu_query[["pca"]] <- CreateDimReducObject(embeddings = t(as.matrix(current.pca_query)), key = "PC_", assay = DefaultAssay(current.seu_query))
DefaultAssay(current.seu_query) = "RNA"
current.sce_query = as.SingleCellExperiment(current.seu_query , assay = "RNA")
sce_query = , sce_query)
# lets concatenate reference and query
sce = cbind(sce_reference , sce_query)
# lets calculate UMAP
umaps = calculateUMAP(t(reducedDim(sce , "PCA"))) )
reducedDim(sce , "UMAP") = umaps
# add ref/query ID
sce$type = sapply(1:ncol(sce) , function(i) ifelse(sce$tech[i] %in% ref_samples , "reference" , "query"))
# lets plot and at least visually assess whether integration worked well (im not v good with Seurat plotting, so it is easier for me to use ggplot)
umaps = cbind(umaps ,
p1 = ggplot(umaps , aes(x = V1 , y = V2 , col = tech)) +
geom_point(alpha = .5) +
theme_bw() +
p2 = ggplot(umaps , aes(x = V1 , y = V2 , col = celltype)) +
geom_point() +
theme_bw() +
p = ggarrange(p1,p2,nrow = 2)
I will attach the plot in the next comment. Please let me know if smth is unclear. P.S. In this particular dataset, counts are actually floats coz of how reads were aligned. With ~standard 10X sequencing I believe you generally should get integer counts.
Why miloDE needs counts:
miloR is looking for DA abundance so virtually doesn't need any assays available but only dimRed together with cell numbers. In turn, miloDE is doing DE and relies on counts.
Therefore I introduced the check for the correct SCE format and this check includes checking for the counts slot. In reality, you are right - for the neighbourhood assignment you dont need to have counts slot (same as in miloR), but since it is required after, the function assign_neighbourhoods
checks SCE in the same manner as for downstream functions. Hope that makes sense?
And even more that the author of Augur says that It's fine to use raw RNA counts as input. Yeah, i saw that Augur does that for the task (CT ranking). The thing tho is that I adapt Augur for a slightly different task and check different datasets between themselves - using logcounts in my opinion helps to mitigate different sequencing coverage for different datasets. Otherwise, you are much more likely to get AUC > 0.5
Vignette with Seurat. Well, essentially this vignette will boil down to vignette about the conversion between Seurat and SCE + how to integrate with Seurat, and those vignettes already exist. The whole miloDE pipeline relies on SCE object, and the only thing I will need to do in the vignette is to convert from Seurat. But since there seems to be a request, I will consider adding this vignette later on (but possibly not ASAP, I have some other stuff in backlog atm).
@mihem , just fyi - i updated the comment response (originally I had it wrong just in case you saw the comment earlier) - please let me know if smth is unclear
@amissarova thanks again for your extensive response.
I think we are making it a little bit more complicated than it needs to be. I am working my own dataset, I just used the pancreas dataset because you asked to provide a toy example, and this is already available. But the pancreas example is more complicated than necessary. I don't have a reference dataset, just a few samples that need to be integrated. Therefore, the follwoing Seurat tutorial is probably more approriate:
# load dataset
# split the dataset into a list of two seurat objects (stim and CTRL)
ifnb.list <- SplitObject(ifnb, = "stim")
# normalize and identify variable features for each dataset independently
ifnb.list <- lapply(X = ifnb.list, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
# select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = ifnb.list)
# specify that we will perform downstream analysis on the corrected data note that the
# original unmodified data still resides in the 'RNA' assay
DefaultAssay(immune.combined) <- "integrated"
# Run the standard workflow for visualization and clustering
immune.combined <- ScaleData(immune.combined, verbose = FALSE)
immune.combined <- RunPCA(immune.combined, npcs = 30, verbose = FALSE)
immune.combined <- RunUMAP(immune.combined, reduction = "pca", dims = 1:30)
immune.combined <- FindNeighbors(immune.combined, reduction = "pca", dims = 1:30)
immune.combined <- FindClusters(immune.combined, resolution = 0.5)
I now want to use this immune.combined for miloDE.
So the main question, which I still didn't get: If I use RNA, I have raw counts (assay RNA, slot count) and normalized counts (assay RNA, slot data), but they are not batch-corrected. Is this the right input for miloDE? I think your tutorial suggests that you need pca.corrected, which would be equivalent integrated assay (or harmony assay) in Seurat/Harmony? But then if I use the integrated assay only, raw counts are missing and we only haved 2000 top variable genes as you correctly say.
In Lemur, the other tool that Rahul Satija advertised on Twitter next to miloDE, the author clearly states that you should not use harmony/integrate assay, but use the normalized data from Seurat and the align_harmony
from lemur
In contrast, in miloR, Mike/Emma say that using any batch-corrected data is fine So for miloR I just used the integrated assay and everything worked fine. But for miloDE this doesn't work because raw counts are required, which make sense for the DE analysis.
Sorry, you may have already answered this partially in your answer, but I couldn't follow your script from
# now lets project pca from reference onto query samples and return SCE with calculated PCA
on. And again, there's no need for a reference dataset. Just the question, which assays are needed and how to convert a Seurat object with RNA and integrated assay correctly to a SCE object for the miloDE analysis.
Thank you !
@mihem, In this case, you just need to:
Ah I understand. I thought that if you use
as.SingleCellExperiment(seu, assay = “RNA”)
you only keep raw counts and no batch-corrected data, so not sufficient for miloDE. But you are right, PCA is included in that object and since PCA was only calculated on the integrated assay it's a batch-corrected assay. Maybe an important note for other users of if you include that in your vignette: If you leave out assay = RNA
, and just run as.SingleCellExperiment(seu)
, PCA/UMAP are not included.
Sorry just one remaining question. What's the right way for miloR for DA analysis then? Also as.SingleCellExperiment(seu, assay = “RNA”)
or integrated assay? Because unlike miloDE, miloR works fine if you the integrated assay, but maybe it's still technically wrong?
Thank you!
as far as for the main part of miloR - neighbourhood assignment and estimation of DA - I'm pretty sure the only thing you need is your embedding (PCA on integrated assay in your case) and cell/sample composition - and then it shouldn't matter which assay you use for miloR, from the integrated or RNA (I assume you are asking since you already performed your milo-analysis on integrated and you are wondering if you should re-do it -- I believe you don't; if you want another confirmation tho, you prob want to open an issue in miloR github).
But i should mention tho that there are other functions in milo for the downstream analysis (such as findNhoodMarkers
for example but possibly more) - and for them you are required original logcounts or counts assay - so if you used this function on integrated assay from the conversion, you had your logcounts 'wrong' and you probably would need to re-do this part of the analysis.
Thank you again.
I reran miloR with RNA assay (including PCA/UMAP based on integrated assay) and not integrated assay. I took much longer so I followed mike's advice and used refinement_scheme = "graph"
in makeNhoods
and fdr.weighting = "graph-overlap"
in testNhoods
. Results were similar but I got more neighbourhoods than with the integrated assay (and top 2k variables genes).
So this issue can be closed now I think, thanks again.
For the vignette: Seurat users need to remember to explicitly name the assay
argument in as.SingleCellExperiment
and use the RNA assay.
sorry for necrobumping: just one additional comment for future me and other users:
If you use Harmony instead of Seurat CCA for batch removal, PCA is NOT batch-corrected. Then HARMONY should be used in
as reducedDim_name
I think.
Hi @amissarova,
As someone who has used miloR before, this looks like a great package that we are eager to explore more. I have a couple of questions I wanted to ask, if okay:
Is there a recommended dimensional reduction to use as an input when converting a Seurat object to SCE? We frequently use Harmony for integration/batch correction - should the Harmony embeddings be used? Or is it preferable to use the PCA embeddings without running Harmony? I've run a prelim ~150k dataset using the Harmony embeddings so far and am running a separate analysis looking the PCA alone. Any example code you may have would be appreciated.
Thanks for #27 - that has been very helpful in speeding up run times. At the moment, we are running
without pre-filtering out any neighborhoods. My understanding of the documentation is that this should not affect the output; just the runtime. Is that accurate?In your design of this package, have you noticed any issues where there are minimal DEGs across clusters between disease and control but miloDE is able to identify genes/modules that are differentially expressed? We've noticed that with one of our datasets (~150k PBMCs) where pseudobulk analyses of our clusters (using limma-voom) results in essentially no DEGs between conditions. However, using the Harmony embeddings as input to miloDE resulted in DEGs and two modules using scWGCNA whose pathway analysis is biologically consistent with what we understand from animal studies. It was hard reconciling the drastic difference between cluster-dependent (traditional) and cluster-independent analyses. It may be that certain PBMCs (e.g., T cells) live in a more continuous cell state than the discrete clusters we annotated them as.
Thanks! Looking forward to using this package more often and re-running some of our old work with this to, perhaps, get better biological insight.