Working with large & complex datasets

MikeDMorgan commented 3 years ago

We may run into serious memory issues for large data sets. Namely, with the initial sampling of makeNhoods. I have a suspicion that very sparse initial sampling might be sub-optimal on large data sets, and this should be ~0.3. However, on large data sets (~200,000 cells, ~140 donors), the memory explodes.

emdann commented 3 years ago

Why increase prop when you can increase k?

MikeDMorgan commented 3 years ago

k controls both the size of the nhood and therefore the resolution of the graph, whilst prop controls the graph coverage. I know that past a certain point, the nhoods will start to saturate as you increase prop, however, I think this point increases as both the total N cells and the total M samples increases. I suspect this is principally an issue for cohort-sized experiments where M is >> 10.

The solution is just to make makeNhoods more memory efficient - hence the enhancement tag; it's not critical to the current, and probably most common, use of Milo.

tomthomas3000 commented 2 years ago

@MikeDMorgan for analysis on use cases like the one above you outlined (>200k cells), do you have any suggestions short of subsetting the data? Thank you for making this tool, very useful!

MikeDMorgan commented 2 years ago

Hi @tomthomas3000 We have a new way to define neighbourhoods and correct with the spatial FDR that is much more scalable that is in the devel branch (devtools::install_github("MarioniLab/miloR", head="devel")):

Make your graph as before, then for making nhoods:

milo.obj <- makeNhoods(milo.obj, k=k, props=props, refined=TRUE, refinement_scheme="graph")

Then when it comes to performing spatial FDR correction:

# with testNhoods
milo.res <- testNhoods(..., fdr.weighting="graph-overlap")

This is much more scalable as it circumvents the need for computing distance calculations. Please do give us some feedback if this works for you.

On a separate note, we anecdotally find that for very large data sets with large numbers of samples, a large k~[50, 100] and a small prop~[0.01, 0.1] are also beneficial in reducing the redundancy of too many nhoods.

zroger49 commented 2 years ago

Hi @MikeDMorgan. Thank you for your package, and for the great work.

I'm working with a dataset of ~ 250K cell and from 70 donors, and I have followed the vignette to get my results. Indeed I had some memory issues and running Milo on my data takes around ~ 8 hours! The function which seems to take more time to run is calcNhoodDistance. Does the approach you outline above circumvent step? As I understand it testNhoods(..., fdr.weighting="graph-overlap") is able to take care of the spatial FDR correction. I just wanted to double check before I go down the rabbit hole.

Here is my script (simplified)

milo <- Milo(sce)
#Run Milo
milo <- buildGraph(milo, k = 150, d = 30, reduced.dim = "PCA")

set.seed(42) ##Set seed to have reproducible results in cell sampling
milo <- makeNhoods(milo, prop = 0.05, k = 150, d = 30, refined = TRUE, reduced_dims = "PCA", refinement_scheme="graph") 
da_results <- testNhoods(milo, design = ~ AgeScaled + Sex + Disease + Ever_Smoker, design.df = design,  fdr.weighting="graph-overlap")

Thank you ,

RR

MikeDMorgan commented 2 years ago

Hi @zroger49 - yes, using the graph-based approach completely removes the need to compute any distances, so you should see a considerable improvement in computation time.

zroger49 commented 2 years ago

@MikeDMorgan Hi. Yes there were substantial improvements in the computational time. In fact, the analysis which previously took 8-9 hours now can be completed in ~30 minutes. I just have one question

The results I obtained on my first analysis (using calcNhoodDistance) were significantly different. We saw a higher number of nhoods with differential cell abundance, which seemed to be from different areas of the graph (After converting for annotated celltype, we had significant nhoods for 10 celltypes). The code is similar to what I've posted above:

milo <- buildGraph(milo, k = 150, d = 30, reduced.dim = "PCA") 
milo <- makeNhoods(milo, prop = 0.05, k = 150, d= 30, refined = TRUE, reduced_dims = "PCA") 
milo <- countCells(milo, meta.data = as.data.frame(colData(milo)), sample = "Subject_Identity")
milo <- calcNhoodDistance(milo, d=30, reduced.dim = "PCA")
da_results <- testNhoods(milo, design = ~ AgeScaled + Sex + Disease + Ever_Smoker, design.df = design)

When I ran the analysis again, we only obtained significant nhoods from 3 celltypes. Is this difference in result to be expected?

MikeDMorgan commented 2 years ago

Could you be a little more specific when you say you had significant nhoods? What did you use as your FDR threshold, and were the previous results all very close to this threshold? You can make a direct comparison by running and saving the results using the 2 different approaches for the spatial FDR (note you will need to use the same nhood definitions though). It may well be that your nhoods are different between the two methods.

You also have a very large k which could make a difference, have you tried k~100 instead?

zroger49 commented 2 years ago

@MikeDMorgan MikeDMorgan What I meant with significant nhoods was "neighborhood with significantly more cells from the experimental condition compared with controls". The nhoods were different between my analysis, unfortunately I did not save the nhood matrix from the first analysis to make this comparison. I did notice that there was a slight tendency for the SpatialFDR to be closer to 0.05 (which was my threshold).

I also felt a k = 150 was very large. In your tutorials you mention that the average nhood size should be 5*N, with N being the number of subjects. Increasing the k was the only way I found to obtain this effect.

MikeDMorgan commented 2 years ago

Hi @zroger49 From a statistical perspective if all of your results are located close to your significance boundary then you need to think about whether or not you would consider these biological interesting or not.

MarioniLab / miloR

Working with large & complex datasets #108