Closed redbybeing closed 8 months ago
Update: 200g memory still failed. It's strange. I'm currently running again with 500g, I can let you know whether it completes and how long it took.
Job ID: 24787044 Cluster: longleaf User/Group: jiseokl/users State: OUT_OF_MEMORY (exit code 0) Cores: 1 CPU Utilized: 01:01:55 CPU Efficiency: 98.46% of 01:02:53 core-walltime Job Wall-clock time: 01:02:53 Memory Utilized: 119.01 GB Memory Efficiency: 119.01% of 100.00 GB
Job ID: 24825030 Cluster: longleaf User/Group: jiseokl/users State: OUT_OF_MEMORY (exit code 0) Cores: 1 CPU Utilized: 02:20:54 CPU Efficiency: 98.37% of 02:23:14 core-walltime Job Wall-clock time: 02:23:14 Memory Utilized: 268.44 GB Memory Efficiency: 134.22% of 200.00 GB
Hi Jiseok, thanks for the encouraging words! 🙂
No, it should not take 100GB to analyze your data. Have you tried analyzing your data locally (e.g., on a laptop or desktop)? I recall having analyzed your data on my laptop a few months ago; it took about an hour and used only a few GB of memory. Are you setting parallel
to TRUE
when you run on SLURM? Unfortunately, sceptre
is not currently configured to run in parallel on clusters, and this could be causing problems.
Yes, import_data()
looks for gene names starting with “MT” as opposed to “mt”. Good catch. I just released an update (9.2) that, among other things, fixes this bug. Now, any gene starting with “MT-” or “mt-” is considered a mitochondrial gene. Maybe you could try again and let me know if the fix worked?
Yes, response_n_umis
, response_n_nonzero
, and response_p_mito
are the covariates that sceptre
by default uses to do cellwise QC. However, you can pass additional_cells_to_remove
, which allows you to specify any additional cells to remove (by index). So if you wanted to use the percent.mt
column, you could determine the indices of the cells whose percent.mt
value exceeds some threshold, and then then you could pass these indices to run_qc()
.
I’ve not seen this behavior before. All 17,627 pairs should be tested. Are you seeing this on the cluster? If so, are you setting parallel = TRUE
? If you have access to a Mac laptop or desktop, it might be a good idea to run your analysis on that machine (setting parallel = TRUE
) before moving onto the cluster.
Could you please remind me whether you are in low- or high-MOI? And what the sceptre-estimated MOI of your dataset is? I had not realized that you are using CRISPRko. Understanding a bit more about your data will help me answer this question.
Thanks for the feedback. This kind of feedback is extremely useful as we try to bring the package to a more stable, mature state.
Cheers, Tim
Hi Tim! 👋
1) So it looks like running the job on cluster could be causing multiple problems. I will try running both on my personal Mac laptop (setting parallel=TRUE) and on cluster (setting parallel=FALSE in all functions having that parameter). But eventually we would like to run everything on cluster since, you know, that's where all large data files are stored and other members need access too, not just me.
2) My data is definitely high-MOI. This is what sceptre
computed after assign_grnas()
with a threshold=3, but no matter what I consider my dataset high-MOI. And yes, it's CRISPR KO, not CRISPRi, where the effect can be more inconsistent than CRISPRi.
gRNA-to-cell assignment information:
• Assignment method: thresholding
• Mean N cells per gRNA: 901.75
• Mean N gRNAs per cell (MOI): 0.87
I attached a slide showing histograms of gRNA distribution for my data that I custom generated.
Will keep you posted, thanks! jiseok.pdf
Jiseok
parallel = FALSE
on the cluster.There are somewhat hacky ways to do this within the framework of the sceptre package, but I am not sure it is worthwhile to go down this route on your data. Let me try to convince you that what you're currently doing is reasonable. Consider a given gRNA. Suppose this gRNA has a UMI count of >= 3 in 100 cells, a UMI count of 1-2 in 100 cells, and a UMI count of zero in ~80,000 cells. The cells with a UMI count of 1-2 are going to exert a negligible impact because they are "swamped" in number by the 80,000 cells with a UMI count of zero. Thus, removing the cells with a UMI count of 1-2 from the control group will have essentially zero impact on the p-value. (I've spent some time looking into this kind of phenomenon on other datasets.)
The hacky way to do this within sceptre would involve looping over gRNAs, rerunning QC separately for each gRNA. We could discuss this approach in more detail if you like, but to me this seems pretty low priority in comparison to some of the other analyses tasks that remain. Just my two cents!
Hi Tim, before other things, quick questions-
2. I tried running sceptre on my local mac, but it gave me this error when trying to run import_data()
. My mac is 2023 M2 Pro with 16 GB memory😭 Do you think the variable sizes are too big? (screenshot attached)
Sorry, never mind- I removed that big 7.8G Seurat object and then it ran. 😅
run_qc()
. Then my computer will fail at run_calibration_check()
. It will totally die out of memory and I can't even move the mouse cursor so I have to force restart the computer. 😅 I set parallel=TRUE when there is an option to do that. Am I doing something wrong?
Hi Jiseok,
I think I forgot to update the version number in the DESCRIPTION file of the package. Could you please try to install the package again? I think it should work now. You might want to uninstall the package before reinstalling (see here.)
Nice!
Hm, sorry to hear that's happening. It looks like issues are appearing both locally and on the cluster, suggesting there might be an issue with the code (either the package or the analysis script).
When I analyzed your data a few months ago, I don't recall having run into any of these issues. 🤔
If possible, would you be able to send me (tbarry2@andrew.cmu.edu) your analysis script and data so that I can try to reproduce the error on my machine? Alternately, we might consider hopping on a brief Zoom call to debug. Thanks for your patience.
Hi Tim I sent you an email. My data has increased in size since your last test (30k->80k cells). Thanks for your prompt response I really appreciate it!
Hi Jiseok,
I was able to reproduce your error when running the discovery analysis. I think that the Apple silicon Macs sometimes run out of memory when too many processors are in use. Anyway, I set n_processors = 3
in all functions in which I set parallel = TRUE
, and this seemed to fix the problem. Would you mind giving that a try? There are other aspects of the analysis that might be good to discuss, but one step at a time. :)
Hi Tim- Quick update:
response_p_mito
from my data. I can see the covariate from object summary after running import_data()
.n_processors = 3
is letting my mac run run_calibration_check()
so far. It's been running for an hour and it's about half done (total 17627 pairs).. It's using ~15 gb out of my 16 gb memory though so it's thrilling 😅parallel = FALSE
. It's been running for 10 hrs... According to timestamps it took ~5 hrs to finish run_calibration_check()
so I don't think setting parallel = FALSE
on cluster is making things faster.. However, it did compute all 17627 of 17627 negative pairs though!parallel = FALSE
fixed the bug, which is helpful information.Hi Jiseok,
I made a small commit that should make using the parallel functionality a bit smoother based on our discussion. Thanks again for the feedback.
Did the script finish running on your local?
Hi Tim, good news!
sceptre_outputs_localmac
. I think until cluster-compatible sceptre is released, I will stick to running locally. plot_run_discovery_analysis.png
, why isn't it showing red dots for neg control pairs in QQ plot?construct_trans_pairs()
, then all pairs (including the PC pairs) are included in the discovery set.Just a quick addition fyi-
Running on cluster with parallel=FALSE
produced identical results as local run (number of pairs computed are same, number of significant discovery pairs are same, etc). It just took longer (6 hr vs 1 day 10 hrs with 500 gb ram).
Hi Jiseok, that's good to hear, and thanks for this update. I'll just note that you very likely do not need 500 GB RAM on the cluster. I am guessing that ~16 GB would suffice (when parallel
is set to FALSE
).
Closing.
Hi,
First of all, thank you SO much for releasing the updated sceptre 0.9.1. I love all the added functionalities that makes it easier to track all the details. And the tutorial book is awesome. It's a tremendous amount of work that will benefit many people analyzing crispr screens.
I'm trying to analyze my dataset of ~80k cells, 85 gRNAs (including unfortunately just ONE negative control gRNA). ~752,900 discovery pairs after pairwise QC.
I ran into several questions/issues:
1) I am submitting my sceptre R script as a slurm job with 100 gb memory request, but the job keeps failing around calibration check or discovery analysis due to out of memory error. I am re-trying with 200 gb memory. Is this expected? Lots of memory and running time required?
2)
import_data()
automatically computesresponse_p_mito
but after importing my data, I don't see theresponse_p_mito
covariate. I use mouse genome and mitochondrial gene names should start with mt- instead of MT- (human). Maybe this is why?3) In
run_qc()
, are these three covariates the only ones I can QC cells based on?:response_n_umis_range
,response_n_nonzero_range
,p_mito_threshold
What if I have a covariate column named 'percent.mt' instead of 'response_p_mito'? I added custom 'percent.mt' using Seurat becauseimport_data()
didn't automatically computeresponse_p_mito
.4) I found that the number of negative pairs tested in
run_calibration_check()
varies a lot between runs (of the same data with same code). Message I got:And then after that I got: N negative control pairs called as significant: XX/~900 one time, and XX/~2700 another time, etc. Why would this happen? Why not test all 17627?
5) Finally, I wonder whether you could add the option to define control cells as those having 0 gRNAs for the target. When doing
assign_grna()
, I use thresholding method with threshold = 3. But I want the control cells to be cells with absolutely 0 gRNAs of the target, not just less than 3. Because I'm worried that even just 1 or 2 gRNA reads might still do something in the cell mildly (especially when it's CRISPR KO, not CRISPRi).Again, thank you so much for this wonderful package and thanks for your time going through my lengthy post.
Best, Jiseok