Clustering datasets with >50 million cells

AbbeyFigliomeni commented 6 months ago

Hi,

Does anyone have experience clustering datasets with excess of 50 million cells? Mine has 59 million cells, average 1 million per person, and I keep getting the error "cannot allocate vector size xx gB/mB". FYI the dataset contains 14 clustering markers of interest.

I understand the flowSOM algorithm is not designed to handle datasets this large, but would prefer not to subset my data prior to clustering to avoid any loss of data/effects due to random sampling.

Any suggestions would be greatly appreciated! :)

AbbeyFigliomeni commented 6 months ago

FYI this data is already pre-processed with exclusion of dead/doublets/debris

tomashhurst commented 6 months ago

Hi Abbey,

Are you seeing this at the FlowSOM step or earlier? This issue mostly comes up as a kind of data handling issue, but we have run datasets of that size through FlowSOM with no problems. It has come up in some other functions before, but we have a couple of options to get around it.

Tom

From: Abbey Figliomeni @.> Sent: Tuesday, March 5, 2024 8:50:57 PM To: ImmuneDynamics/Spectre @.> Cc: Subscribed @.***> Subject: Re: [ImmuneDynamics/Spectre] Clustering datasets with >50 million cells (Issue #185)

FYI this data is already pre-processed with exclusion of dead/doublets/debris

— Reply to this email directly, view it on GitHubhttps://github.com/ImmuneDynamics/Spectre/issues/185#issuecomment-1978357464, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACZYS66Q4XV4TTXWAXR22G3YWWIQDAVCNFSM6AAAAABEGY3Y3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZYGM2TONBWGQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

AbbeyFigliomeni commented 6 months ago

Hi Tom, Thanks for your super prompt response! The issue appears when running the "runflowsom" command. I use a code that extracts my data directly from my flowJo workspace using cytoML, which I have used in the past many times without issues. I also just ran the exact same dataset using the same script, just with my gatingSet object (cytoML) subsetted on a downstream gate containing much fewer cells (can provide code if it helps, but essentially the data has the exact same parameter scaling/structure but 800,000 events instead of 59mil), and it was able to complete the clustering and dimensionality reduction without issues. Could it be my processor only being a ryzen 5? This same computer took 10 hours to cluster a previous dataset with 40million events. What are some work-arounds that have worked in the past?

SamGG commented 6 months ago

@AbbeyFigliomeni Did you get a numerical value where the message says "xx"? Do you know the RAM amount in this computer? Alternatively, as I am using RStudio on my Windows10 computer, in the environment tab there is pie chart showing the memory used. If I click on it and ask for a memory usage report, I get how much RAM is used by RStudio and is on my computer. Alt., the command sum(sapply(ls(), function(x) object.size(get(x))))/1024^3 reports the amount (in GiB) of RAM currently used. My two cents...

AbbeyFigliomeni commented 6 months ago

Hi Samuel, the number differed when deleting or including certain cluster marker channels, ranged from GB to mb. I have been looking at the pie chart/control panel and the program has been using a large amount of memory, will check the RAM. Thanks!

Sent from my phone - please excuse typos.

From: Samuel Granjeaud @.> Sent: Tuesday, March 5, 2024 10:42:21 PM To: ImmuneDynamics/Spectre @.> Cc: Abbey Figliomeni @.>; Mention @.> Subject: Re: [ImmuneDynamics/Spectre] Clustering datasets with >50 million cells (Issue #185)

@AbbeyFigliomenihttps://github.com/AbbeyFigliomeni Did you get a numerical value where the message says "xx"? Do you know the RAM amount in this computer? Alternatively, as I am using RStudio on my Windows10 computer, in the environment tab there is pie chart showing the memory used. If I click on it and ask for a memory usage report, I get how much RAM is used by RStudio and is on my computer. Alt., the command sum(sapply(ls(), function(x) object.size(get(x))))/1024^3 reports the amount (in GiB) of RAM currently used. My two cents...

— Reply to this email directly, view it on GitHubhttps://github.com/ImmuneDynamics/Spectre/issues/185#issuecomment-1978927576, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A5NDZEDT4OJJYFY64AL4CILYWXKU3AVCNFSM6AAAAABEGY3Y3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZYHEZDONJXGY. You are receiving this because you were mentioned.Message ID: @.***>

RoryCostell commented 6 months ago

I have run into a similar issue with the spatial package, using the stars method to generate the polygons + outlines, for quite large IMC images (2000x2000) - The error shows "vector memory exhausted (limit reached?)"

I've used the fix on - https://stackoverflow.com/questions/51295402/r-on-macos-error-vector-memory-exhausted-limit-reached

R was using around 55Gb of RAM, so this fix extended the RAM available to R, including virtual memory.

Hope this helps.

AbbeyFigliomeni commented 5 months ago

Hi All, Thanks for your feedback. I am working with the hypothesis that it is simply a consequence of my desktop having insufficient RAM (16Gb, only 15gB of which is available to RStudio) to complete the task...

ghar1821 commented 5 months ago

I just tested running flowsom on around 50 million cells on my mac with 24GB ram, and I ran into the same issue. I'll look into the run.flowsom function and see if i can reduce its memory usage.

In the meantime, if you don't have access to computer with bigger ram, as an alternative you can subsample the cells, cluster them, and map the rest into the clusters. This is not ideal as the subsampling may not cover some cell types and caused them to be merged to other cell types (that were included in the subsampling process).

Or you can compress your data into supercells using supercellcyto (https://github.com/phipsonlab/SuperCellCyto) and run flowsom on those supercells. Afterwards, you can expand those supercells back and assign the cells in the supercell the cluster the supercell belong to. Disclaimer: I'm the author of supercellcyto.

AbbeyFigliomeni commented 5 months ago

Hi all, thanks for all those who weighed in RE the RAM issue and @ghar1821 for your helpful feedback.

Just an update for anyone who is interested/facing the same issue: I managed to substantially decrease the size of my data table by deleting all phenodata columns except patient identifiers, and prior to clustering removing all other irrelevant objects from my workspace. I successfully clustered (although took 4 hours with my trust 16gB processor!), and then re-added my phenodata columns. Rest of the workflow as normal.

tomashhurst commented 3 months ago

@AbbeyFigliomeni nice solution! We'll keep this in mind for when this comes up in future.

ImmuneDynamics / Spectre

Clustering datasets with >50 million cells #185