Katsevich-Lab / sceptre

An R package for single-cell CRISPR screen data analysis emphasizing statistical rigor, massive scalability, and ease of use.
https://katsevich-lab.github.io/sceptre/
GNU General Public License v3.0
25 stars 8 forks source link

fit_parametric_curve = FALSE and failure issue #71

Closed redbybeing closed 8 months ago

redbybeing commented 8 months ago

Hello! I'm back again :)

After previous successful runs of sceptre both in local mac and cluster (with parallel=FALSE), I tried changing a parameter to improve my negative control calibration. The only thing I changed is fit_parametric_curve = TRUE to FALSE. resampling_mechanism was set to permutations and I didn't change that. After this change, local mac RStudio keeps failing at the run_calibration_check step. So I tried on cluster (with parallel=FALSE and 100 gb memory) and it still fails. Does setting fit_parametric_curve=FALSE cause a big difference in running? I kept all other conditions same and it had worked well few weeks ago.

Best, Jiseok

timothy-barry commented 8 months ago

Hi Jiseok,

It is good to hear from you, and I hope you had a good break. :)

This is something that I have noticed myself recently. How many negative control / discovery pairs are you trying to test? Unfortunately, fit_parametric_curve = FALSE starts to break down when the number of pairs exceeds ~50,000. You might consider reducing the number of pairs that you are testing and setting fit_parametric_curve = FALSE.

redbybeing commented 8 months ago

Hi Tim,

Hope you had a good break too! Best wishes for the new year.

I have 16,064 negative control pairs and 685,353 discovery pairs (after pairwise QC). I tried running just up to run_calibration_check to see whether fit_parametric_curve = FALSE will improve negative control calibration, but it fails there.

Jiseok

timothy-barry commented 8 months ago

Hm, I see. That is a bit surprising. What happens when you reduce the number of negative control pairs to (say) 100? Note that you can set this parameter in run_calibration_check.

redbybeing commented 8 months ago

I just tried with n_calibration_pairs = 100 (local mac) and it still failed🤔 I think precisely at this step:

Screenshot 2024-01-09 at 8 34 52 PM Screenshot 2024-01-09 at 8 34 07 PM
timothy-barry commented 8 months ago

Got it. Thanks for pointing this out. It seems to be a bug. I still have your data and analysis script; I'll try to reproduce the bug and get back to you!

redbybeing commented 8 months ago

Thanks Tim. Please note that the data I gave you before has less than half cells (~30k) of my current data (~80k).

timothy-barry commented 8 months ago

Hi Jiseok,

I believe that you sent me an updated version of your data. This is what the sceptre_object looks like:

Attributes of the data:
    • 88183 cells
    • 32285 responses
    • High multiplicity-of-infection 
    • 84 targeting gRNAs (distributed across 42 targets) 
    • 1 non-targeting gRNAs 
    • 6 covariates (batch, grna_n_nonzero, grna_n_umis, response_n_nonzero, response_n_umis, response_p_mito)

Is this correct?

redbybeing commented 8 months ago

Oh that's great! Yes. Note: 'batch' in covariates could be 'batch_num' instead, but I'm not sure (just a naming from my original seurat object)

timothy-barry commented 8 months ago

OK, nice.

I tried running the code that you previously sent to me, setting parallel = FALSE in set_analysis_parameters(). (Hopefully you don't mind my pasting your code here.) Are you able to run the following without issue? It worked for me. Note that data_dir will have to be updated on your machine.

options(stringsAsFactors = F)
library(sceptre)
packageVersion('sceptre') # Latest version 0.9.2

# References:  
# https://katsevich-lab.github.io/sceptre/articles/sceptre.html
# https://timothy-barry.github.io/sceptre-book/
# https://timothy-barry.github.io/sceptre-book/import-data.html#import-data-from-a-collection-of-r-objects

# Directory to save files
data_dir <- "/Users/timbarry/research_offsite/external/user-data/"

# 1) Import data R matrices
gex <- readRDS(paste0(data_dir, "A04_gex_SingleR_anno_50k.rds"))
# 1. response_matrix
response_matrix <- gex@assays$RNA$counts
dim(response_matrix)
# 2. grna_matrix
grna_matrix <- gex@assays$crispr$counts
dim(grna_matrix)
# 3. extra_covariates
extra_covariates <- gex@meta.data[,"batch_num", drop = FALSE] |> dplyr::rename("batch" = "batch_num")
head(extra_covariates)
# 4. grna_target_data_frame
gRNAs <- gex@assays$crispr@counts@Dimnames[[1]]
gRNAtargets <- gsub('.{2}$', '', gRNAs)
gRNAtargets[gRNAtargets == "Setd8"] <- "Kmt5a" # Target name should match the name in the response matrix
gRNAtargets[gRNAtargets == "Whsc1l1"] <- "Nsd3" # Target name should match the name in the response matrix
grna_target_data_frame <- data.frame(cbind(gRNAs, gRNAtargets))
colnames(grna_target_data_frame) <- c("grna_id", "grna_target")
grna_target_data_frame[grna_target_data_frame$grna_id=="GFP-1", "grna_target"] <- "non-targeting" 
head(grna_target_data_frame)
# 5. moi
moi <- "high"
# 6. clear stuff
rm(gex,gRNAs,gRNAtargets); gc()

# Setup sceptre object
sceptre_object <- import_data(
  response_matrix = response_matrix,
  grna_matrix = grna_matrix,
  grna_target_data_frame = grna_target_data_frame,
  moi = moi,
  extra_covariates = extra_covariates
)
gc()
print(sceptre_object)

# 2) Set analysis parameters 
# 1. Positive control pairs
positive_control_pairs <- construct_positive_control_pairs(
  sceptre_object = sceptre_object
)
head(positive_control_pairs)
# 2. Discovery pairs
discovery_pairs <- construct_trans_pairs(
  sceptre_object = sceptre_object,
  pairs_to_exclude = "none",
) |> dplyr::sample_n(5000) # sample 5000 pairs for now
head(discovery_pairs)
# 3. set analysis parameters
sceptre_object <- set_analysis_parameters(
  sceptre_object = sceptre_object,
  positive_control_pairs = positive_control_pairs,
  discovery_pairs = discovery_pairs,
  fit_parametric_curve = FALSE,
  resampling_mechanism = "permutations"
)
print(sceptre_object)

# 3) Assign gRNAs
sceptre_object <- assign_grnas(
  sceptre_object = sceptre_object,
  method = "thresholding",  
  threshold = 3
)
print(sceptre_object)
plot(sceptre_object)

# 4) Run quality control
plot_covariates(sceptre_object) # p-mito appears to already be clipped at 0.1; it would be best to do QC within sceptre
sceptre_object <- run_qc(
  sceptre_object = sceptre_object
)
print(sceptre_object)
plot(sceptre_object)

# 5) Calibration check
sceptre_object <- run_calibration_check(
  sceptre_object = sceptre_object,
  n_calibration_pairs = 100, parallel = TRUE, n_processors = 2
)
print(sceptre_object)
timothy-barry commented 8 months ago

(This is a slightly modified version of your code.)

redbybeing commented 8 months ago

Oh wow, yes it ran quick without an error on my mac! Then will it run well with the original ~16k negative control pairs and ~685k discovery pairs?

Screenshot 2024-01-10 at 3 06 05 PM
timothy-barry commented 8 months ago

That's great to hear.

But do we have a sense as to why you were encountering an error originally? Can you tell if the code that you were using is different in some important way than the code that I pasted?

I would not expect the fit_parametric_curve = FALSE method to work beyond ~10,000 - 50,000 pairs. Is there a way to prioritize your discovery pairs such that you can analyze the ~10,000 most important pairs?

redbybeing commented 8 months ago

Ah, I think I found out- Here, only after I added |> dplyr::sample_n(5000) # sample 5000 pairs for now, the code ran all the way to calibration check. So even if I don't actually run discovery analysis, I should reduce the discovery pairs in order for calibration check to run successfully?

discovery_pairs <- construct_trans_pairs(
  sceptre_object = sceptre_object,
  positive_control_pairs = positive_control_pairs,
  pairs_to_exclude = "none"
) |> dplyr::sample_n(5000) # sample 5000 pairs for now
redbybeing commented 8 months ago

Well as of now, I am doing sort of large-scale blind trans-analysis and I don't have a way to prioritize my discovery pairs down to ~10000. My original concern was severe miscalibration of negative control pairs:

image

N negative control pairs called as significant: 1017/17627 So I wanted to try setting fit_parametric_curve = FALSE to improve the calibration. If it won't work for too many pairs, I might have to just accept my data as is..?

timothy-barry commented 8 months ago

So even if I don't actually run discovery analysis, I should reduce the discovery pairs in order for calibration check to run successfully?

Got it. That would be ideal, yes.

If it won't work for too many pairs, I might have to just accept my data as is..?

Unfortunately, I'm not quite sure what is causing the miscalibration that we are observing here. If I recall correctly, the calibration on your smaller dataset was very good. I do think it would be a good idea to continue going through the steps for improving calibration outlined in the book. If calibration does not improve, it may be possible to empirically select the FDR threshold using the negative control p-values.

redbybeing commented 8 months ago

Thank you very much Tim. For now I will keep in mind that fit_parametric_curve = FALSE works for smaller number of pairs, and I will try the other steps to improve calibration. Best, Jiseok