Closed MikeDMorgan closed 2 years ago
Why increase prop
when you can increase k
?
k
controls both the size of the nhood and therefore the resolution of the graph, whilst prop
controls the graph coverage. I know that past a certain point, the nhoods will start to saturate as you increase prop
, however, I think this point increases as both the total N cells and the total M samples increases. I suspect this is principally an issue for cohort-sized experiments where M is >> 10.
The solution is just to make makeNhoods
more memory efficient - hence the enhancement tag; it's not critical to the current, and probably most common, use of Milo.
@MikeDMorgan for analysis on use cases like the one above you outlined (>200k cells), do you have any suggestions short of subsetting the data? Thank you for making this tool, very useful!
Hi @tomthomas3000
We have a new way to define neighbourhoods and correct with the spatial FDR that is much more scalable that is in the devel branch (devtools::install_github("MarioniLab/miloR", head="devel")
):
Make your graph as before, then for making nhoods:
milo.obj <- makeNhoods(milo.obj, k=k, props=props, refined=TRUE, refinement_scheme="graph")
Then when it comes to performing spatial FDR correction:
# with testNhoods
milo.res <- testNhoods(..., fdr.weighting="graph-overlap")
This is much more scalable as it circumvents the need for computing distance calculations. Please do give us some feedback if this works for you.
On a separate note, we anecdotally find that for very large data sets with large numbers of samples, a large k~[50, 100] and a small prop~[0.01, 0.1] are also beneficial in reducing the redundancy of too many nhoods.
Hi @MikeDMorgan. Thank you for your package, and for the great work.
I'm working with a dataset of ~ 250K cell and from 70 donors, and I have followed the vignette to get my results. Indeed I had some memory issues and running Milo on my data takes around ~ 8 hours! The function which seems to take more time to run is calcNhoodDistance
.
Does the approach you outline above circumvent step? As I understand it testNhoods(..., fdr.weighting="graph-overlap")
is able to take care of the spatial FDR correction. I just wanted to double check before I go down the rabbit hole.
Here is my script (simplified)
milo <- Milo(sce)
#Run Milo
milo <- buildGraph(milo, k = 150, d = 30, reduced.dim = "PCA")
set.seed(42) ##Set seed to have reproducible results in cell sampling
milo <- makeNhoods(milo, prop = 0.05, k = 150, d = 30, refined = TRUE, reduced_dims = "PCA", refinement_scheme="graph")
da_results <- testNhoods(milo, design = ~ AgeScaled + Sex + Disease + Ever_Smoker, design.df = design, fdr.weighting="graph-overlap")
Thank you ,
RR
Hi @zroger49 - yes, using the graph-based approach completely removes the need to compute any distances, so you should see a considerable improvement in computation time.
@MikeDMorgan Hi. Yes there were substantial improvements in the computational time. In fact, the analysis which previously took 8-9 hours now can be completed in ~30 minutes. I just have one question
The results I obtained on my first analysis (using calcNhoodDistance) were significantly different. We saw a higher number of nhoods with differential cell abundance, which seemed to be from different areas of the graph (After converting for annotated celltype, we had significant nhoods for 10 celltypes). The code is similar to what I've posted above:
milo <- buildGraph(milo, k = 150, d = 30, reduced.dim = "PCA")
milo <- makeNhoods(milo, prop = 0.05, k = 150, d= 30, refined = TRUE, reduced_dims = "PCA")
milo <- countCells(milo, meta.data = as.data.frame(colData(milo)), sample = "Subject_Identity")
milo <- calcNhoodDistance(milo, d=30, reduced.dim = "PCA")
da_results <- testNhoods(milo, design = ~ AgeScaled + Sex + Disease + Ever_Smoker, design.df = design)
When I ran the analysis again, we only obtained significant nhoods from 3 celltypes. Is this difference in result to be expected?
Could you be a little more specific when you say you had significant nhoods? What did you use as your FDR threshold, and were the previous results all very close to this threshold? You can make a direct comparison by running and saving the results using the 2 different approaches for the spatial FDR (note you will need to use the same nhood definitions though). It may well be that your nhoods are different between the two methods.
You also have a very large k
which could make a difference, have you tried k~100 instead?
@MikeDMorgan MikeDMorgan What I meant with significant nhoods was "neighborhood with significantly more cells from the experimental condition compared with controls". The nhoods were different between my analysis, unfortunately I did not save the nhood matrix from the first analysis to make this comparison. I did notice that there was a slight tendency for the SpatialFDR to be closer to 0.05 (which was my threshold).
I also felt a k = 150
was very large. In your tutorials you mention that the average nhood size should be 5*N, with N being the number of subjects. Increasing the k
was the only way I found to obtain this effect.
Hi @zroger49 From a statistical perspective if all of your results are located close to your significance boundary then you need to think about whether or not you would consider these biological interesting or not.
We may run into serious memory issues for large data sets. Namely, with the initial sampling of
makeNhoods
. I have a suspicion that very sparse initial sampling might be sub-optimal on large data sets, and this should be ~0.3. However, on large data sets (~200,000 cells, ~140 donors), the memory explodes.