Running sceptre 0.9.1 several questions

redbybeing commented 8 months ago

Hi,

First of all, thank you SO much for releasing the updated sceptre 0.9.1. I love all the added functionalities that makes it easier to track all the details. And the tutorial book is awesome. It's a tremendous amount of work that will benefit many people analyzing crispr screens.

I'm trying to analyze my dataset of ~80k cells, 85 gRNAs (including unfortunately just ONE negative control gRNA). ~752,900 discovery pairs after pairwise QC.

I ran into several questions/issues:

1) I am submitting my sceptre R script as a slurm job with 100 gb memory request, but the job keeps failing around calibration check or discovery analysis due to out of memory error. I am re-trying with 200 gb memory. Is this expected? Lots of memory and running time required?

2) import_data() automatically computes response_p_mito but after importing my data, I don't see the response_p_mito covariate. I use mouse genome and mitochondrial gene names should start with mt- instead of MT- (human). Maybe this is why?

3) In run_qc(), are these three covariates the only ones I can QC cells based on?: response_n_umis_range, response_n_nonzero_range, p_mito_threshold What if I have a covariate column named 'percent.mt' instead of 'response_p_mito'? I added custom 'percent.mt' using Seurat because import_data() didn't automatically compute response_p_mito.

4) I found that the number of negative pairs tested in run_calibration_check() varies a lot between runs (of the same data with same code). Message I got:

Constructing negative control pairs. ✓ Note: Unable to generate the number of negative control pairs (752900) requested. Generating as many negative control pairs (17627) as possible.

And then after that I got: N negative control pairs called as significant: XX/~900 one time, and XX/~2700 another time, etc. Why would this happen? Why not test all 17627?

5) Finally, I wonder whether you could add the option to define control cells as those having 0 gRNAs for the target. When doing assign_grna(), I use thresholding method with threshold = 3. But I want the control cells to be cells with absolutely 0 gRNAs of the target, not just less than 3. Because I'm worried that even just 1 or 2 gRNA reads might still do something in the cell mildly (especially when it's CRISPR KO, not CRISPRi).

Again, thank you so much for this wonderful package and thanks for your time going through my lengthy post.

Best, Jiseok

redbybeing commented 8 months ago

Update: 200g memory still failed. It's strange. I'm currently running again with 500g, I can let you know whether it completes and how long it took.

Job ID: 24787044 Cluster: longleaf User/Group: jiseokl/users State: OUT_OF_MEMORY (exit code 0) Cores: 1 CPU Utilized: 01:01:55 CPU Efficiency: 98.46% of 01:02:53 core-walltime Job Wall-clock time: 01:02:53 Memory Utilized: 119.01 GB Memory Efficiency: 119.01% of 100.00 GB

Job ID: 24825030 Cluster: longleaf User/Group: jiseokl/users State: OUT_OF_MEMORY (exit code 0) Cores: 1 CPU Utilized: 02:20:54 CPU Efficiency: 98.37% of 02:23:14 core-walltime Job Wall-clock time: 02:23:14 Memory Utilized: 268.44 GB Memory Efficiency: 134.22% of 200.00 GB

timothy-barry commented 8 months ago

Hi Jiseok, thanks for the encouraging words! 🙂

No, it should not take 100GB to analyze your data. Have you tried analyzing your data locally (e.g., on a laptop or desktop)? I recall having analyzed your data on my laptop a few months ago; it took about an hour and used only a few GB of memory. Are you setting parallel to TRUE when you run on SLURM? Unfortunately, sceptre is not currently configured to run in parallel on clusters, and this could be causing problems.
Yes, import_data() looks for gene names starting with “MT” as opposed to “mt”. Good catch. I just released an update (9.2) that, among other things, fixes this bug. Now, any gene starting with “MT-” or “mt-” is considered a mitochondrial gene. Maybe you could try again and let me know if the fix worked?
Yes, response_n_umis, response_n_nonzero, and response_p_mito are the covariates that sceptre by default uses to do cellwise QC. However, you can pass additional_cells_to_remove, which allows you to specify any additional cells to remove (by index). So if you wanted to use the percent.mt column, you could determine the indices of the cells whose percent.mt value exceeds some threshold, and then then you could pass these indices to run_qc().
I’ve not seen this behavior before. All 17,627 pairs should be tested. Are you seeing this on the cluster? If so, are you setting parallel = TRUE? If you have access to a Mac laptop or desktop, it might be a good idea to run your analysis on that machine (setting parallel = TRUE) before moving onto the cluster.
Could you please remind me whether you are in low- or high-MOI? And what the sceptre-estimated MOI of your dataset is? I had not realized that you are using CRISPRko. Understanding a bit more about your data will help me answer this question.

Thanks for the feedback. This kind of feedback is extremely useful as we try to bring the package to a more stable, mature state.

Cheers, Tim

redbybeing commented 8 months ago

Hi Tim! 👋

1) So it looks like running the job on cluster could be causing multiple problems. I will try running both on my personal Mac laptop (setting parallel=TRUE) and on cluster (setting parallel=FALSE in all functions having that parameter). But eventually we would like to run everything on cluster since, you know, that's where all large data files are stored and other members need access too, not just me.

2) My data is definitely high-MOI. This is what sceptre computed after assign_grnas() with a threshold=3, but no matter what I consider my dataset high-MOI. And yes, it's CRISPR KO, not CRISPRi, where the effect can be more inconsistent than CRISPRi. gRNA-to-cell assignment information: • Assignment method: thresholding • Mean N cells per gRNA: 901.75 • Mean N gRNAs per cell (MOI): 0.87 I attached a slide showing histograms of gRNA distribution for my data that I custom generated.

Will keep you posted, thanks! jiseok.pdf

Jiseok

timothy-barry commented 8 months ago

OK, that makes sense. We will be releasing a Nextflow pipeline in the near future that will enable users to run sceptre in parallel on a cluster. I'll let you know when that becomes available. Please do let me know what happens when you set parallel = FALSE on the cluster.
Thanks for the information. So just to clarify, for a given gRNA, you'd prefer the control group to consist of all cells containing a UMI count of exactly zero for that gRNA?

There are somewhat hacky ways to do this within the framework of the sceptre package, but I am not sure it is worthwhile to go down this route on your data. Let me try to convince you that what you're currently doing is reasonable. Consider a given gRNA. Suppose this gRNA has a UMI count of >= 3 in 100 cells, a UMI count of 1-2 in 100 cells, and a UMI count of zero in ~80,000 cells. The cells with a UMI count of 1-2 are going to exert a negligible impact because they are "swamped" in number by the 80,000 cells with a UMI count of zero. Thus, removing the cells with a UMI count of 1-2 from the control group will have essentially zero impact on the p-value. (I've spent some time looking into this kind of phenomenon on other datasets.)

The hacky way to do this within sceptre would involve looping over gRNAs, rerunning QC separately for each gRNA. We could discuss this approach in more detail if you like, but to me this seems pretty low priority in comparison to some of the other analyses tasks that remain. Just my two cents!

redbybeing commented 8 months ago

Hi Tim, before other things, quick questions-

How do I install 0.9.2? I tried this and it installed 0.9.1 on my local mac. install.packages("devtools") devtools::install_github("katsevich-lab/sceptre")

2. I tried running sceptre on my local mac, but it gave me this error when trying to run import_data(). My mac is 2023 M2 Pro with 16 GB memory😭 Do you think the variable sizes are too big? (screenshot attached)

Sorry, never mind- I removed that big 7.8G Seurat object and then it ran. 😅

Okay.. so on my local mac, I am able to run up to run_qc(). Then my computer will fail at run_calibration_check(). It will totally die out of memory and I can't even move the mouse cursor so I have to force restart the computer. 😅 I set parallel=TRUE when there is an option to do that. Am I doing something wrong?

timothy-barry commented 8 months ago

Hi Jiseok,

I think I forgot to update the version number in the DESCRIPTION file of the package. Could you please try to install the package again? I think it should work now. You might want to uninstall the package before reinstalling (see here.)
Nice!
Hm, sorry to hear that's happening. It looks like issues are appearing both locally and on the cluster, suggesting there might be an issue with the code (either the package or the analysis script).

When I analyzed your data a few months ago, I don't recall having run into any of these issues. 🤔

If possible, would you be able to send me (tbarry2@andrew.cmu.edu) your analysis script and data so that I can try to reproduce the error on my machine? Alternately, we might consider hopping on a brief Zoom call to debug. Thanks for your patience.

redbybeing commented 8 months ago

Hi Tim I sent you an email. My data has increased in size since your last test (30k->80k cells). Thanks for your prompt response I really appreciate it!

timothy-barry commented 8 months ago

Hi Jiseok,

I was able to reproduce your error when running the discovery analysis. I think that the Apple silicon Macs sometimes run out of memory when too many processors are in use. Anyway, I set n_processors = 3 in all functions in which I set parallel = TRUE, and this seemed to fix the problem. Would you mind giving that a try? There are other aspects of the analysis that might be good to discuss, but one step at a time. :)

redbybeing commented 8 months ago

Hi Tim- Quick update:

0.9.2 installed and it successfully computes response_p_mito from my data. I can see the covariate from object summary after running import_data().
Setting n_processors = 3 is letting my mac run run_calibration_check() so far. It's been running for an hour and it's about half done (total 17627 pairs).. It's using ~15 gb out of my 16 gb memory though so it's thrilling 😅
I am also running on cluster with parallel = FALSE. It's been running for 10 hrs... According to timestamps it took ~5 hrs to finish run_calibration_check() so I don't think setting parallel = FALSE on cluster is making things faster.. However, it did compute all 17627 of 17627 negative pairs though!

timothy-barry commented 8 months ago

Great!
Great! Hopefully this run completed.
OK, well it sounds like setting parallel = FALSE fixed the bug, which is helpful information.

timothy-barry commented 8 months ago

Hi Jiseok,

I made a small commit that should make using the parallel functionality a bit smoother based on our discussion. Thanks again for the feedback.

Did the script finish running on your local?

redbybeing commented 8 months ago

Hi Tim, good news!

sceptre finished running successfully on my local mac, took ~6 hrs total (which seems reasonable) without crashing and it computed all neg control pairs and QC-ed discovery pairs it should be computing. I uploaded the outputs in the same google drive folder under sceptre_outputs_localmac. I think until cluster-compatible sceptre is released, I will stick to running locally.
Is that small commit you mentioned for local running? So should I reinstall 0.9.2?
In my plot_run_discovery_analysis.png, why isn't it showing red dots for neg control pairs in QQ plot?
Are positive control pairs (e.g., Fxr1-Fxr1) part of the discovery result table if I didn't explicitly exclude them from discovery pairs?
Now I have the room to discuss about defining control cells. Let me compose another reply just for that.

timothy-barry commented 8 months ago

Nice! Yes, 6 hours and <16GB memory using 3 processors sounds about right for this dataset/analysis.
Yes, you should feel free to reinstall 0.9.2, although you won't notice much of a difference.
The reason is that the number of negative control pairs is less than the number of discovery pairs, so it does not make sense to superimpose the negative control p-values on top of the discovery p-values. One thing that you might consider doing is downsampling the discovery pairs such that the number of discovery pairs does not exceed the number of negative control pairs.
Yes, if you used construct_trans_pairs(), then all pairs (including the PC pairs) are included in the discovery set.
OK, this sounds good.

redbybeing commented 8 months ago

Just a quick addition fyi- Running on cluster with parallel=FALSE produced identical results as local run (number of pairs computed are same, number of significant discovery pairs are same, etc). It just took longer (6 hr vs 1 day 10 hrs with 500 gb ram).

timothy-barry commented 8 months ago

Hi Jiseok, that's good to hear, and thanks for this update. I'll just note that you very likely do not need 500 GB RAM on the cluster. I am guessing that ~16 GB would suffice (when parallel is set to FALSE).

timothy-barry commented 8 months ago

Closing.

Katsevich-Lab / sceptre

Running sceptre 0.9.1 several questions #68