MarioniLab / miloDE

Framework for sensitive DE testing (using neighbourhoods)
Other
61 stars 2 forks source link

Interoperability with Seurat and comparison with traditional (clustered) DE analysis #29

Closed learning-MD closed 1 year ago

learning-MD commented 1 year ago

Hi @amissarova,

As someone who has used miloR before, this looks like a great package that we are eager to explore more. I have a couple of questions I wanted to ask, if okay:

  1. Is there a recommended dimensional reduction to use as an input when converting a Seurat object to SCE? We frequently use Harmony for integration/batch correction - should the Harmony embeddings be used? Or is it preferable to use the PCA embeddings without running Harmony? I've run a prelim ~150k dataset using the Harmony embeddings so far and am running a separate analysis looking the PCA alone. Any example code you may have would be appreciated.

  2. Thanks for #27 - that has been very helpful in speeding up run times. At the moment, we are running de_test_neighbourhoods without pre-filtering out any neighborhoods. My understanding of the documentation is that this should not affect the output; just the runtime. Is that accurate?

  3. In your design of this package, have you noticed any issues where there are minimal DEGs across clusters between disease and control but miloDE is able to identify genes/modules that are differentially expressed? We've noticed that with one of our datasets (~150k PBMCs) where pseudobulk analyses of our clusters (using limma-voom) results in essentially no DEGs between conditions. However, using the Harmony embeddings as input to miloDE resulted in DEGs and two modules using scWGCNA whose pathway analysis is biologically consistent with what we understand from animal studies. It was hard reconciling the drastic difference between cluster-dependent (traditional) and cluster-independent analyses. It may be that certain PBMCs (e.g., T cells) live in a more continuous cell state than the discrete clusters we annotated them as.

Thanks! Looking forward to using this package more often and re-running some of our old work with this to, perhaps, get better biological insight.

mihem commented 1 year ago

I (and I think many others) would be also very intersted in those answers, espeicially concerning the correct input.

An official response e.g. from @amissarova would be greatly appreciated.

Based on answers from MikeDMorgan in miloR https://github.com/MarioniLab/miloR/issues/151, I think using harmony/integrated assay from Seurat would be the counterpart to pca.corrected that milo team uses in assign_neighborhoods. Correct?

I think it is confusing that miloDE uses different functions than miloR. So e.g. is miloDE::assign_neighbourhoods the same as miloR::buildGraphs and miloR::makeNhoods ?

One pitfall is that when you use the integrated (so batch corrected) input from Seurat with as.SingleCellExperiment(seu, assay = "integrated) you only have a logcounts, but not a counts assay. This works fine for miloR, but does not work for miloDE, e.g. for assign_neighbourhoods.

Thanks!

amissarova commented 1 year ago

Hey @learning-MD and @mihem , thanks for your interest in the package. I'll try to answer to both lines of questions in one thread, pls let me know if something is unclear or if you have follow-up questions.

  1. Choice of dimensionality reduction. We found that dimensionality reduction choice, at least for DE testing, matters and affects the sensitivity of the detection. We contemplate on this in the Supplementary Note, so if you are interested, please have a look here: https://www.biorxiv.org/content/10.1101/2023.03.08.531744v1.supplementary-material (Supplementary Note 1). The TLDR version of this note boils down to this. The ~‘continuous’ miloDE nature can run into a circular, self-contradictory conundrum: we use transcriptome to group cells together into transcriptionally ‘very similar’ cells, however, then we strive to find transcriptional differences between the cells. Practically this conundrum can be, at least partially, resolved if we define that case-specific variance can be decomposed in common (between case and control) variance (i.e. cell types) and case-only variance (i.e. disease genes), and therefore in miloDE we aim to detect case-only specific variance. In other words, crudely speaking, we can think of each gene contributing mainly to the common variance (i.e. cell type markers) or disease-specific variance. What this practically means, in terms of the embedding, is that you want the embedding that is originally constructed based on the control/shared variance, with case data being post-hoc ‘mapped’ onto the embedding (and therefore case variance doesn't contribute to the embedding and can be well detected with miloDE). Methods that are specifically tailored for this task fall into the category ‘supervised reference-based learning’, and the most commonly and successfully used methods (up to my knowledge) would be scArches and Azimuth mappings (fro Azimuth, good step-by-step manual can be found here https://satijalab.org/seurat/articles/integration_mapping.html ). Strictly speaking, MNN is an unsupervised approach however it would somewhat depend on the choice of genes you submit. Specifically, I recommend that if you use MNN, you ensure that your HVG selection is extracted using reference data only - thus you minimise the probability of selecting case-specific DE genes, and in this case the performance of miloDE likely to resemble supervise reference-based approaches. Specific to Harmony - unfortunately, I dont have the in-depth practical knowledge on how it would perform since I never tested it for miloDE. Theoretically, I can imagine that in Harmony disease-specific variance is likely to be treated as a batch effect and be corrected for (which is good for our cause actually). My concern however will lie with how Harmony might confuse real biological differences between replicates from the same condition with the batch effect, and this difference will be overlaid on top of the other in the embedding which might lead to misleading FPs (you detect some genes to be DE whereas they are actually more reference-specific, subtle within-CT genes). But to reemphasize - I dont really know how Harmony would perform, and it is a well-known batch correction method (which is ultimately what you want for the embedding for miloDE) - so it is very possible it will be completely fine. I personally often run MNN-corrected PCA (ensuring that HVGs were selected from control data only), and I’m generally satisfied with that.

I hope this answer clarifies a bit how I recommend people to think about embedding choices when they apply miloDE.

  1. How filtering of neighbourhoods affects DE detection. Filtering of neighbourhoods is potentially good not only for run time, but for detection as well. By a priori discarding neighbourhoods that are very likely to be of uninterest (and therefore likely to have high p-values), you mitigate the severity for multiple testing correction. In principle, we propose a way to identify neighbourhoods with ‘no signs of DE’ a priori, by adapting Augur (https://github.com/neurorestore/Augur) on a neighbourhood basis (miloDE::calc_AUC_per_neighbourhood). I want to emphasize though, that in the presence of batch effects (which is almost the case), unfortunately, using Augur we can not distinguish batch effect from potential DE, and after running this rather time-consuming part of the algorithm, you might end up with a very little gain (i.e. vast majority of the neighbourhoods will be selected for the downstream analysis - negligible gain in terms of multiple testing correction). But, to be fair, another class of neighbourhoods that our Augur-based neighbourhood selection identifies as inappropriate for DE testing - are neighbourhodos with very few cells in at least one condition which sometimes is very possible, so it might be useful. To put this confusing paragraph in a more detailed pipeline, I suggest the next approach toward neighbourhood selection.

before running DE testing itself, assess your data in 2 ways: 1 - do you anticipate batch effects in your data or rather not? 2 - do you observe a lot of highly DA neighbourhoods (say with less than 5-10 cells in at least one condition). If you do not anticipate much batch effect - consider running nighbourhood selection: miloDE::calc_AUC_per_neighbourhood. You can use a standard cut-off of AUC of 0.5 to select neighbourhoods for follow-up DE testing. If you anticipate batch effect but you see that you have quite a lot of neighbourhoods with very few cells in at least one condition - consider running a custom script that quickly scans those neighbourhoods and discards them from the DE testing (it will be much quicker than building Augur classifiers). Atm it is not supported in the main branch, but I consider adding it later and let me know if that's of interest, I can help drafting a script to do just that. Finally, if you anticipate batch effect, as well as there, are not many imbalanced neighbourhoods - I think you can safely proceed to DE testing without any neighbourhood selection.

  1. discrepancies between ‘bulk’ and ‘miloDE’ approaches. Yeah, that is possible and in fact, thats the main reason we developed miloDE :) I guess there are maybe a few scenarios why this happens, but the one which we advocate is most likely and the advantage of miloDE is that it is much more sensitive than bulk method and is capable of detecting DE differences in very local regions of the manifold (sub-cell types to the most granular level). This is particularly relevant for continuous cell types so if this is your case, what you observe is not too far reached.

If in your case the difference between ‘bulk’ and ‘miloDE’ seems overwhelming, I would recommend manually expecting miloDE results to provide more confidence in the results.

Few suggestions on what you can do:

you say you detect 2 modules that agree with expectations based on animal studies. Plot a few genes from the modules individually (you can use miloDE::plot_DE_single_gene as well as simply plot UMAPs colored by counts and faceted by conditions). Does this look like what you would expect based on the ‘module signature’? In the modules signature plots - do neighbourhoods with high DE seem to be close to each other or are they rather sporadic? If they seem too sporadic - this is the sign of FP detection, but if they seem rather coherent and near by - then it is something interesting to investigate further (i.e. potentially relevant). If the neighbourhoods cluster together and assume they are annotated within the same CT - what is the difference between the rest of the neighbourhoods of this CT? One way would be to select only reference data and perform DE detection between neighbourhoods (the ones that are flagged as very DE in miloDE VS the rest). Do you see some plausible biological markers that separate two groups of neighbourhoods? I believe that for this task you can use miloR::findNhoodMarkers is tailored to do something akin to what Im suggesting. Finally, using these markers - can you further subcluster your cell type. If yes, do you now observe DE in the sub-cell type of where you expect it to see? If yes - this is the good sign that miloDE happened to be sensitive enough to pick the local sub-cell type difference that bulk DE missed.

  1. Are miloDE::assign_neighbourhoods the same as miloR::buildGraphs and miloR::makeNhoods? Almost, but not quite. Essentially, miloDE::assign_neighbourhoods is a wrapper for these 2 miloR functions, but with some additional tweaks/options: We allow (we actually encourage) the 2nd-order kNN graph assignment (the standard kNN is also possible, you need to select order=1). We provide a neighbourhood refinement step.

In principle, you can use original Milos functions, but in this case, you need to force your k to be rather big as well as you might end up with too many overlapping, sort of redundant, neighbourhoods. So I would encourage you to use miloDE’s assignment functions.

  1. Conversion from Seurat for integrated assay. Yeah, I can see how it is a little annoying. Tbh I’m surprised to see that counts are not converting to SCE. If you could share a toy Seurat ds for which it doesnt work, I could try it myself? I can suggest that one (rather unelegant) workaround would be to rewrite your logcounts to counts in your Seurat object, and then perform the conversion. Since you are converting inetrgated object, I assume that your dimensionality reduction has been already calculated, and in this case you dont really need logcounts for miloDE.
learning-MD commented 1 year ago

Thanks @amissarova - that is an incredibly thorough response! I think Harmony should be okay to use, but I'll test out a couple of other strategies as well - mainly, I may just focus on 1) integrating together the control samples first followed by reference-mapping of the disease onto the controls and 2) not integrating samples in the first place (since, by visualization of the unintegrated samples, I cannot really make out a batch effect - samples were hashtagged, so it goes a long ways in minimizing batch effects). The latter is the reason why I think using the Harmony embeddings as I did is okay.

Trying to plot gene modules as you suggested, one issue I've run into with plot_DE_single_gene and plot_DE_gene_set is that I keep running into the following error that I'm having difficulty troubleshooting:

Error in .check_nhood_stat(nhood_stat, x) : 
  'nhood_stat$Nhood' should be within c(1:ncol(nhoods(x))).

Any suggestions there? Thanks.

amissarova commented 1 year ago

hey, @learning-MD for plotting error - can you open a new issue pls? But judging by error message, it looks like there is mismatch between neighbourhoods ids from the nhood_stat and nhoods(x). DId you ensure that nhood_stat is calculated from the same milo object you are using for plot_DE_single_gene?

Also, when you open an issue - can you print out (assuming your milo object called sce_milo):

dim(nhoods(sce_milo))
unique(nhood_stat$Nhood)
# if you use subset_nhoods, print them out too
setdiff(unique(nhood_stat$Nhood) , c(1:ncol(nhoods(sce_milo))))
amissarova commented 1 year ago

Also @learning-MD , I just realised that when you were asking about neighbourhood filtering, you might have meant filtering at the stage of neighbourhood assignment? In this case yeah, it is mainly for computing time (and on rather rare occasions it might help for multiple testing correction issue, but rarely so)

mihem commented 1 year ago

@amissarova Thanks for your fast and very thorough response.

Actually, I am mainly interested in the 1. and 5. point. So here is as a reproducible example from https://satijalab.org/seurat/articles/integration_mapping.html .

library(Seurat)
library(SeuratData)
InstallData("panc8")

data("panc8")

Let's downsample the dataset so the pipeline runs faster:

seu <- subset(x = panc8, downsample = 500)

And now just run the rest of the code

pancreas.list <- SplitObject(seu, split.by = "tech")
pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]

for (i in 1:length(pancreas.list)) {
    pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
    pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000,
        verbose = FALSE)
}

reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2")]
pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)

library(ggplot2)
library(cowplot)
library(patchwork)

DefaultAssay(pancreas.integrated) <- "integrated"
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30, verbose = FALSE)

p1 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend()
p1 + p2

Now we're done with integration and can continue with milo.

Now the important step: Let's use the integrated assay, because that's the batch-corrected assay.

sce_integrated <- as.SingleCellExperiment(pancreas.integrated, assay = "integrated")

Now miloDE:

set.seed(123)
sce_milo <- assign_neighbourhoods(
  sce_integrated,
  k = 30,
  prop = 0.1,
  d = 30,
  order = 2,
  filtering = TRUE,
  reducedDim_name = "PCA",
  verbose = TRUE
)

But this fails, because there is no counts assay

 Error in fun(arg) : 
  x should contain 'counts' assay that will be used to calculate DE. If counts are stored in different assay, please move them to slot 'counts'.

I could use the RNA assay instead, then the function runs fine. But then I use raw uncorrected counts, right?

sce_rna <- as.SingleCellExperiment(pancreas.integrated, assay = "RNA")
set.seed(123)
sce_milo <- assign_neighbourhoods(
  sce_rna,
  k = 30,
  prop = 0.1,
  d = 30,
  order = 2,
  filtering = TRUE,
  reducedDim_name = "PCA",
  verbose = TRUE
)

I find it confusing that the miloR functions work fine with the integrated (so no counts assay).

library(miloR)
milo_integrated <- buildGraph(sce_integrated, k = 30, d = 30)
milo_integrated <- makeNhoods(milo_integrated, prop = 0.1, k = 30, d = 30, refined = TRUE)

And even more that the author of Augur says that It's fine to use raw RNA counts as input. https://github.com/neurorestore/Augur/issues/21

I think a lot of users would be very happy, if you could provide a vignette starting with a Seurat object. E.g. this is very nicely done in Harmony: http://htmlpreview.github.io/?https://github.com/immunogenomics/harmony/blob/master/docs/SeuratV3.html

Thank you!

learning-MD commented 1 year ago

Thanks @amissarova - the plot_DE_gene functions ran fine with me just reloading the milo object. Regarding some of the suggestions you made about assessing for whether the two modules were false positives or true, below are some plots of the cell identities, module DE, and a few individual genes from a module. Of note, this is related to an autoimmune disease, so we expect multiple immune cell types to be involved (based on animal models, for example, monocytes, NK cells, and CD4 T cells have all been implicated with interferon gamma thought to be playing a key role).

The below were built using the default parameters from the vignette. With that in mind, not sure whether you'd consider the below to be "sporadic" and inconsistent with true positives:

image image image image

Apologies for these potentially basic questions - it's our first foray into cluster-free DE analyses.

amissarova commented 1 year ago

@learning-MD , it is very hard for me to give a proper answer for the system I don't really know so I cant help you much w it. few random thoughts:

learning-MD commented 1 year ago

Thanks @amissarova - appreciate the guidance.

mihem commented 1 year ago

@amissarova just a friendly reminder of my last message in case you missed it because this issue is so busy. It was just yesterday, so please don't feel rushed.

amissarova commented 1 year ago

Hey @mihem,

  1. I looked through your pipeline and I think the issue with moving from Seurat to SCE consists of a few sub-issues. Below I will paste how to resolve these issues - just note that I decided not to do downsampling since 500 cells seem a bit too low for me to work with; also, I calculate UMAPs post hoc, you will see why later:

In the pipeline, when you run this line:

sce_integrated <- as.SingleCellExperiment(pancreas.integrated, assay = "integrated")
> sce_integrated
class: SingleCellExperiment 
dim: 2000 5683 
metadata(0):
assays(1): logcounts
rownames(2000): COL3A1 COL1A1 ... SHPK CRYBB2P1
rowData names(0):
colnames(5683): D101_5 D101_7 ... HP1526901T2D_N8 HP1526901T2D_A8
colData names(9): orig.ident nCount_RNA ... dataset ident
reducedDimNames(1): PCA
mainExpName: integrated
altExpNames(0):

you choose to move integrated modality which is not the original counts/logcounts. It is calculated only on 2k variable genes as well as the values correspond to integrated (or ‘batch-corrected’) expression matrix for all cells. What you really care to save tho is the calculated on this integrated matrix PCA - that will be your embedding to be used by miloDE. So if you want to move both counts/logcounts, you need to run this:

sce_integrated <- as.SingleCellExperiment(pancreas.integrated, assay = "RNA")
sce_integrated
> sce_integrated
class: SingleCellExperiment 
dim: 34363 5683 
metadata(0):
assays(2): counts logcounts
rownames(34363): A1BG-AS1 A1BG ... ZRSR1 pk
rowData names(0):
colnames(5683): D101_5 D101_7 ... HP1526901T2D_N8 HP1526901T2D_A8
colData names(9): orig.ident nCount_RNA ... dataset ident
reducedDimNames(1): PCA
mainExpName: RNA
altExpNames(0):

So now you have the counts and logcounts both stored in your SCE object.

However, this is actually not the right object yet to work with, because it only contains integrated reference data - in https://satijalab.org/seurat/articles/integration_mapping.html, the second part of the vignette shows how now map your query data onto reference. I assume, that for miloDE you are interested in both reference + query - so now you need to project your query data onto reference, and then concatenate. In the vignette from Seurat, they show how to project cell type labels and UMAPs, but we need to project actual PCs. I made a little script to do this - please note that Im really not v proficient with Seurat so it is v possible there are better ways to do it, but this one should work:

library(Seurat)
library(SeuratData)
options(timeout=100)
SeuratData::InstallData("panc8.SeuratData")
seu = panc8

# split combined object by samples
pancreas.list <- SplitObject(seu, split.by = "tech")
for (i in 1:length(pancreas.list)) {
    pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
    pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000,
        verbose = FALSE)
}

# assign reference and query samples - here I follow the mapping vignette from Seurat
ref_samples = c("celseq", "celseq2", "smartseq2")
query_samples = c("fluidigmc1")

# find anchors and integrate reference samples
reference.list <- pancreas.list[ref_samples]
pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)

# calc PCA for inetgrated data - thats what we actually want ultimately
DefaultAssay(pancreas.integrated) <- "integrated"
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)

# now lets move to SCE reference object, but with the original RNA assay which contains raw counts as well as normalised
sce_reference = as.SingleCellExperiment(pancreas.integrated , assay = "RNA")

# now lets project pca from reference onto query samples and return SCE with calculated PCA
DefaultAssay(pancreas.integrated) = "integrated"
sce_query = lapply(query_samples , function(current.sample){
  current.seu_query = pancreas.list[[which(names(pancreas.list) == current.sample)]]
  query_anchors <- FindTransferAnchors(reference = pancreas.integrated, query = current.seu_query,
                                         dims = 1:30, reference.reduction = "pca")
  current.pca_query <- TransferData(anchorset = query_anchors, refdata = t(Embeddings(pancreas.integrated[['pca']])), dims = 1:30)
  current.pca_query = current.pca_query[1:nrow(current.pca_query), 1:ncol(current.pca_query)]
  current.seu_query[["pca"]] <- CreateDimReducObject(embeddings = t(as.matrix(current.pca_query)), key = "PC_", assay = DefaultAssay(current.seu_query))
  DefaultAssay(current.seu_query) = "RNA"
  current.sce_query = as.SingleCellExperiment(current.seu_query , assay = "RNA")
  return(current.sce_query)
})
sce_query = do.call(cbind , sce_query)

# lets concatenate reference and query
sce = cbind(sce_reference , sce_query)

# lets calculate UMAP
umaps = as.data.frame( calculateUMAP(t(reducedDim(sce , "PCA"))) )
reducedDim(sce , "UMAP") = umaps

# add ref/query ID
sce$type = sapply(1:ncol(sce) , function(i) ifelse(sce$tech[i] %in% ref_samples , "reference" , "query"))

# lets plot and at least visually assess whether integration worked well (im not v good with Seurat plotting, so it is easier for me to use ggplot)
umaps = cbind(umaps , as.data.frame(colData(sce)))

p1 = ggplot(umaps , aes(x = V1 , y = V2 , col = tech)) +
  geom_point(alpha = .5) +
  theme_bw() +
  facet_wrap(~type)
p2 = ggplot(umaps , aes(x = V1 , y = V2 , col = celltype)) +
  geom_point() +
  theme_bw() +
  facet_wrap(~type)
p = ggarrange(p1,p2,nrow = 2)
p

I will attach the plot in the next comment. Please let me know if smth is unclear. P.S. In this particular dataset, counts are actually floats coz of how reads were aligned. With ~standard 10X sequencing I believe you generally should get integer counts.

  1. Why miloDE needs counts: miloR is looking for DA abundance so virtually doesn't need any assays available but only dimRed together with cell numbers. In turn, miloDE is doing DE and relies on counts. Therefore I introduced the check for the correct SCE format and this check includes checking for the counts slot. In reality, you are right - for the neighbourhood assignment you dont need to have counts slot (same as in miloR), but since it is required after, the function assign_neighbourhoods checks SCE in the same manner as for downstream functions. Hope that makes sense?

  2. And even more that the author of Augur says that It's fine to use raw RNA counts as input. Yeah, i saw that Augur does that for the task (CT ranking). The thing tho is that I adapt Augur for a slightly different task and check different datasets between themselves - using logcounts in my opinion helps to mitigate different sequencing coverage for different datasets. Otherwise, you are much more likely to get AUC > 0.5

  3. Vignette with Seurat. Well, essentially this vignette will boil down to vignette about the conversion between Seurat and SCE + how to integrate with Seurat, and those vignettes already exist. The whole miloDE pipeline relies on SCE object, and the only thing I will need to do in the vignette is to convert from Seurat. But since there seems to be a request, I will consider adding this vignette later on (but possibly not ASAP, I have some other stuff in backlog atm).

amissarova commented 1 year ago

Screenshot 2023-04-01 at 11 50 08

amissarova commented 1 year ago

@mihem , just fyi - i updated the comment response (originally I had it wrong just in case you saw the comment earlier) - please let me know if smth is unclear

mihem commented 1 year ago

@amissarova thanks again for your extensive response.

I think we are making it a little bit more complicated than it needs to be. I am working my own dataset, I just used the pancreas dataset because you asked to provide a toy example, and this is already available. But the pancreas example is more complicated than necessary. I don't have a reference dataset, just a few samples that need to be integrated. Therefore, the follwoing Seurat tutorial is probably more approriate: https://satijalab.org/seurat/articles/integration_introduction.html

# load dataset
LoadData("ifnb")

# split the dataset into a list of two seurat objects (stim and CTRL)
ifnb.list <- SplitObject(ifnb, split.by = "stim")

# normalize and identify variable features for each dataset independently
ifnb.list <- lapply(X = ifnb.list, FUN = function(x) {
    x <- NormalizeData(x)
    x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

# select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = ifnb.list)

# specify that we will perform downstream analysis on the corrected data note that the
# original unmodified data still resides in the 'RNA' assay
DefaultAssay(immune.combined) <- "integrated"

# Run the standard workflow for visualization and clustering
immune.combined <- ScaleData(immune.combined, verbose = FALSE)
immune.combined <- RunPCA(immune.combined, npcs = 30, verbose = FALSE)
immune.combined <- RunUMAP(immune.combined, reduction = "pca", dims = 1:30)
immune.combined <- FindNeighbors(immune.combined, reduction = "pca", dims = 1:30)
immune.combined <- FindClusters(immune.combined, resolution = 0.5)

I now want to use this immune.combined for miloDE.

So the main question, which I still didn't get: If I use RNA, I have raw counts (assay RNA, slot count) and normalized counts (assay RNA, slot data), but they are not batch-corrected. Is this the right input for miloDE? I think your tutorial suggests that you need pca.corrected, which would be equivalent integrated assay (or harmony assay) in Seurat/Harmony? But then if I use the integrated assay only, raw counts are missing and we only haved 2000 top variable genes as you correctly say.

In Lemur, the other tool that Rahul Satija advertised on Twitter next to miloDE, the author clearly states that you should not use harmony/integrate assay, but use the normalized data from Seurat and the align_harmony from lemur https://github.com/const-ae/lemur/issues/2 In contrast, in miloR, Mike/Emma say that using any batch-corrected data is fine https://github.com/MarioniLab/miloR/issues/151. So for miloR I just used the integrated assay and everything worked fine. But for miloDE this doesn't work because raw counts are required, which make sense for the DE analysis.

Sorry, you may have already answered this partially in your answer, but I couldn't follow your script from

# now lets project pca from reference onto query samples and return SCE with calculated PCA
...

on. And again, there's no need for a reference dataset. Just the question, which assays are needed and how to convert a Seurat object with RNA and integrated assay correctly to a SCE object for the miloDE analysis.

Thank you !

amissarova commented 1 year ago

@mihem, In this case, you just need to:

  1. (You are already doing it). Calculate PCA on your integrated Seurat object (I mean on the integrated assay) - this pca will be used as coordinates of embedding.
  2. Move your Seurat to SCE with as.SingleCellExperiment(your_integrated_seurat, assay = “RNA”). This will have both raw counts for DE as well PCA reduced dimension slot for the embedding
mihem commented 1 year ago

Thanks.

Ah I understand. I thought that if you use

as.SingleCellExperiment(seu, assay = “RNA”)

you only keep raw counts and no batch-corrected data, so not sufficient for miloDE. But you are right, PCA is included in that object and since PCA was only calculated on the integrated assay it's a batch-corrected assay. Maybe an important note for other users of if you include that in your vignette: If you leave out assay = RNA, and just run as.SingleCellExperiment(seu), PCA/UMAP are not included. Sorry just one remaining question. What's the right way for miloR for DA analysis then? Also as.SingleCellExperiment(seu, assay = “RNA”)or integrated assay? Because unlike miloDE, miloR works fine if you the integrated assay, but maybe it's still technically wrong?

Thank you!

amissarova commented 1 year ago

@mihem,

as far as for the main part of miloR - neighbourhood assignment and estimation of DA - I'm pretty sure the only thing you need is your embedding (PCA on integrated assay in your case) and cell/sample composition - and then it shouldn't matter which assay you use for miloR, from the integrated or RNA (I assume you are asking since you already performed your milo-analysis on integrated and you are wondering if you should re-do it -- I believe you don't; if you want another confirmation tho, you prob want to open an issue in miloR github).

But i should mention tho that there are other functions in milo for the downstream analysis (such as findNhoodMarkers for example but possibly more) - and for them you are required original logcounts or counts assay - so if you used this function on integrated assay from the conversion, you had your logcounts 'wrong' and you probably would need to re-do this part of the analysis.

mihem commented 1 year ago

@amissarova
Thank you again.

I reran miloR with RNA assay (including PCA/UMAP based on integrated assay) and not integrated assay. I took much longer so I followed mike's advice and used refinement_scheme = "graph" in makeNhoods and fdr.weighting = "graph-overlap" in testNhoods. Results were similar but I got more neighbourhoods than with the integrated assay (and top 2k variables genes).

So this issue can be closed now I think, thanks again.

For the vignette: Seurat users need to remember to explicitly name the assay argument in as.SingleCellExperiment and use the RNA assay.

mihem commented 1 year ago

sorry for necrobumping: just one additional comment for future me and other users:

If you use Harmony instead of Seurat CCA for batch removal, PCA is NOT batch-corrected. Then HARMONY should be used in assing_neighbourhoods as reducedDim_name I think.