Open Stein-ErikG opened 1 year ago
hi, so sorry I didn't see this sooner. I'm not sure if you have already found a solution, but were you able to examine the memory consumption during the clustering? and have you tried directly in python?
@Stein-ErikG hi again, i'm wondering whether you tried just using a few million cells to see if it is in fact the sample size that is causing the crash.
Hi, When i have fewer cells to cluster it it much more stable, and does not crash. I have not tried to use the code directly in parc as only know R. I have now even more cells to cluster so i have to revisit this. is there any of the parameters that comes to your mind i could change to help me out? I would like to do a fast and simple clustering as a first step, then do more high quality clusterings later on individual clusters. I will have a look at memory usage during a big clustering
Hi Erik, How many cells are you clustering and what are the dimensions of your input matrix. I do typically find that python easier to use for large datasets Maybe keep knn at 30+, small-pop at 10-20, jac_global_std at 0.15 or so?
I'm happy to try test your data for speed/memory usage quickly if you want to send me a "sanitizer" version as I don't have a good sense of what your data is like
I have previously been using 15 million cells, and clustering on around 20 channels/columns. should i email you a file for you to test? or send in a different way?
I will definitely test your suggestions!
Hi, you can email me, but can I ask what the upper limit you can at the moment successfully cluster on?
I mean how many millions of cells are you able to cluster on without crashing
I have been able to cluster 10 million cells with my setup. But here i have left some samples out so that i dont get these crashes
Cool, im curious how long does that take? I'm happy to try out the 15Mill directly on python to see if that helps and let you know. shobana.venkat88@gmail.com
damn, i dont remember exactly. a few hours i think (i will test again tomorrow). I will try to share a dropbox link with a single file containing non-patient data. I cant share with you the whole dataset, its patient data. but maybe you could test on this file?
Maybe you could share the python code you use with me, then i can test my a big clustering on my mac with python? if you are willing:)
Hi, Sure no problems to share the code I use to run. But just to make sure I understand your setup, you are currently using parc on R ? And have not yet tried Parc in a pure python script yet ? Dropbox sounds good and understand that you cannot share all the data :) Don't worry about re-running on the ten million, I was just curious as I have not tested on such large datasets
Yes, you are correct. Have used R and not yet in a pure python script
Sendt fra Outlook for iOShttps://aka.ms/o0ukef
Fra: Shobi Stassen @.> Sendt: Tuesday, November 1, 2022 3:34:35 PM Til: ShobiStassen/PARC @.> Kopi: Stein-Erik Gullaksen @.>; Mention @.> Emne: Re: [ShobiStassen/PARC] RStudio crashing when clustering a large dataset (Issue #20)
Hi, Sure no problems to share the code I use to run. But just to make sure I understand your setup, you are currently using parc on R ? And have not yet tried Parc in a pure python script yet ? Dropbox sounds good and understand that you cannot share all the data :) Don't worry about re-running on the ten million, I was just curious as I have not tested on such large datasets
— Reply to this email directly, view it on GitHubhttps://github.com/ShobiStassen/PARC/issues/20#issuecomment-1298600580, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHKDLPWOPLJIJDA3UH5NIX3WGES7XANCNFSM56AKWWHQ. You are receiving this because you were mentioned.Message ID: @.***>
Hi,
Thank you so much for the excellent package! I have managed to get your code working in RStudio (using reticulate) on my new macbook Pro M1 Max 64 GB ram laptop. But when i am clustering my mass cytometry dataset of 15 million cells, RStudio crashes after some time with a fatal error and no further info. When i look at the cores during the clustering, all cores are working.
I was hoping you might have some suggestions on how to cluster this dataset, and other larger ones that i have (30 million cells)? i would like to use all cells. Or is it just my mac that cannot handle so big datasets?
I use CATALYST to read data and then i export single cell data into a Large Matrix (661330692 elements, 5.3 GB). this is then passed to PARC. I am clustering on 20 parameters/columns. I have changed the small_pop to make the clustering easier to handle. Is there any other parameters that i could change?
scdf <- data.frame(t(assay(sce, "exprs")[c(type_markers(sce),state_markers(sce)), ]), sample_id = sce$sample_id, check.names = FALSE) scdf <- as.matrix(scdf[,1:(ncol(scdf)-1)])
tic() Sys.setenv("RETICULATE_PYTHON" = "path") library(reticulate) reticulate::py_config() markers <- c(1:length(ToUse)) parc <- import("parc") parc1 <- parc$PARC( data = scdf[,markers], num_threads=20L, resolution_parameter = 1, # defaults to 1. expose this parameter in leidenalg small_pop=2000L,# smallest cluster population to be considered a community knn=30L, hnsw_param_ef_construction = 150L) parc1$run_PARC()
clusters <- unlist(parc1$labels) levels(clusters) <- as.character(as.numeric(levels(clusters)) + 1)
sce@colData@listData[["cluster_id"]] <- as.factor(clusters) sce@metadata[["cluster_codes"]] <- data.frame("PARC" = as.factor(levels(factor(clusters))))
rm(scdf) toc()