ShobiStassen / PARC

MIT License
41 stars 11 forks source link

RStudio crashing when clustering a large dataset #20

Open Stein-ErikG opened 1 year ago

Stein-ErikG commented 1 year ago

Hi,

Thank you so much for the excellent package! I have managed to get your code working in RStudio (using reticulate) on my new macbook Pro M1 Max 64 GB ram laptop. But when i am clustering my mass cytometry dataset of 15 million cells, RStudio crashes after some time with a fatal error and no further info. When i look at the cores during the clustering, all cores are working.

I was hoping you might have some suggestions on how to cluster this dataset, and other larger ones that i have (30 million cells)? i would like to use all cells. Or is it just my mac that cannot handle so big datasets?

I use CATALYST to read data and then i export single cell data into a Large Matrix (661330692 elements, 5.3 GB). this is then passed to PARC. I am clustering on 20 parameters/columns. I have changed the small_pop to make the clustering easier to handle. Is there any other parameters that i could change?

scdf <- data.frame(t(assay(sce, "exprs")[c(type_markers(sce),state_markers(sce)), ]), sample_id = sce$sample_id, check.names = FALSE) scdf <- as.matrix(scdf[,1:(ncol(scdf)-1)])

tic() Sys.setenv("RETICULATE_PYTHON" = "path") library(reticulate) reticulate::py_config() markers <- c(1:length(ToUse)) parc <- import("parc") parc1 <- parc$PARC( data = scdf[,markers], num_threads=20L, resolution_parameter = 1, # defaults to 1. expose this parameter in leidenalg small_pop=2000L,# smallest cluster population to be considered a community knn=30L, hnsw_param_ef_construction = 150L) parc1$run_PARC()

clusters <- unlist(parc1$labels) levels(clusters) <- as.character(as.numeric(levels(clusters)) + 1)

sce@colData@listData[["cluster_id"]] <- as.factor(clusters) sce@metadata[["cluster_codes"]] <- data.frame("PARC" = as.factor(levels(factor(clusters))))

rm(scdf) toc()

ShobiStassen commented 1 year ago

hi, so sorry I didn't see this sooner. I'm not sure if you have already found a solution, but were you able to examine the memory consumption during the clustering? and have you tried directly in python?

ShobiStassen commented 1 year ago

@Stein-ErikG hi again, i'm wondering whether you tried just using a few million cells to see if it is in fact the sample size that is causing the crash.

Stein-ErikG commented 1 year ago

Hi, When i have fewer cells to cluster it it much more stable, and does not crash. I have not tried to use the code directly in parc as only know R. I have now even more cells to cluster so i have to revisit this. is there any of the parameters that comes to your mind i could change to help me out? I would like to do a fast and simple clustering as a first step, then do more high quality clusterings later on individual clusters. I will have a look at memory usage during a big clustering

ShobiStassen commented 1 year ago

Hi Erik, How many cells are you clustering and what are the dimensions of your input matrix. I do typically find that python easier to use for large datasets Maybe keep knn at 30+, small-pop at 10-20, jac_global_std at 0.15 or so?

I'm happy to try test your data for speed/memory usage quickly if you want to send me a "sanitizer" version as I don't have a good sense of what your data is like

Stein-ErikG commented 1 year ago

I have previously been using 15 million cells, and clustering on around 20 channels/columns. should i email you a file for you to test? or send in a different way?

I will definitely test your suggestions!

ShobiStassen commented 1 year ago

Hi, you can email me, but can I ask what the upper limit you can at the moment successfully cluster on?

ShobiStassen commented 1 year ago

I mean how many millions of cells are you able to cluster on without crashing

Stein-ErikG commented 1 year ago

I have been able to cluster 10 million cells with my setup. But here i have left some samples out so that i dont get these crashes

ShobiStassen commented 1 year ago

Cool, im curious how long does that take? I'm happy to try out the 15Mill directly on python to see if that helps and let you know. shobana.venkat88@gmail.com

Stein-ErikG commented 1 year ago

damn, i dont remember exactly. a few hours i think (i will test again tomorrow). I will try to share a dropbox link with a single file containing non-patient data. I cant share with you the whole dataset, its patient data. but maybe you could test on this file?

Maybe you could share the python code you use with me, then i can test my a big clustering on my mac with python? if you are willing:)

ShobiStassen commented 1 year ago

Hi, Sure no problems to share the code I use to run. But just to make sure I understand your setup, you are currently using parc on R ? And have not yet tried Parc in a pure python script yet ? Dropbox sounds good and understand that you cannot share all the data :) Don't worry about re-running on the ten million, I was just curious as I have not tested on such large datasets

Stein-ErikG commented 1 year ago

Yes, you are correct. Have used R and not yet in a pure python script

Sendt fra Outlook for iOShttps://aka.ms/o0ukef


Fra: Shobi Stassen @.> Sendt: Tuesday, November 1, 2022 3:34:35 PM Til: ShobiStassen/PARC @.> Kopi: Stein-Erik Gullaksen @.>; Mention @.> Emne: Re: [ShobiStassen/PARC] RStudio crashing when clustering a large dataset (Issue #20)

Hi, Sure no problems to share the code I use to run. But just to make sure I understand your setup, you are currently using parc on R ? And have not yet tried Parc in a pure python script yet ? Dropbox sounds good and understand that you cannot share all the data :) Don't worry about re-running on the ten million, I was just curious as I have not tested on such large datasets

— Reply to this email directly, view it on GitHubhttps://github.com/ShobiStassen/PARC/issues/20#issuecomment-1298600580, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHKDLPWOPLJIJDA3UH5NIX3WGES7XANCNFSM56AKWWHQ. You are receiving this because you were mentioned.Message ID: @.***>