abelson-lab / scATOMIC

Pan-Cancer Single Cell Classifier
MIT License
57 stars 5 forks source link

Error run_scATOMIC #5

Closed ccruizm closed 1 year ago

ccruizm commented 1 year ago

Hello!

Super nice tool! I am trying to apply it to my dataset but I am encountering the next error:

Processing sample: Brain_tumor 
[1] "Starting Layer 1"
[1] "Done Layer 1"
[1] "Starting Layer 2 Non Blood"
[1] "Done Layer 2 Non Blood"
[1] "Starting Layer 3 Non Stromal"
[1] "Done Layer 3 Non Stromal"
[1] "Starting Layer 4 Non GI"
[1] "Done Layer 4 Non GI"
[1] "Starting Layer 5 Breast Lung Prostate"
[1] "nothing to score in this layer"
[1] "Done Layer 5 Breast Lung Prostate"
[1] "Starting Layer 5 Ovarian Endometrial Kidney"
[1] "nothing to score in this layer"
[1] "Done Layer 5 Ovarian Endometrial Kidney"
[1] "Starting Layer 4 Soft Tissue Neuro"
[1] "Done Layer 4 Soft Tissue Neuro"
[1] "Starting Layer 5 Soft Tissue Neuro"
[1] "Done Layer 5 Soft Tissue Neuro"
[1] "Starting Layer 6 Brain NBM"
[1] "nothing to score in this layer"
[1] "Done Layer 6 Brain NBM"
[1] "Starting Layer 3 Normal Stromal"
[1] "Done Layer 3 Normal Stromal"
[1] "Starting Layer 2 Blood"
[1] "Done Layer 2 Blood"
[1] "Starting Layer 3 Myeloid"
[1] "Done Layer 3 Myeloid"
[1] "Starting Layer 4 Macrophage Monocyte"
[1] "nothing to score in this layer"
[1] "Done Layer 4 Macrophage Monocyte"
Warning message:
“The following features are not present in the object: MLF1IP, not searching for symbol synonyms”
Warning message:
“The following features are not present in the object: FAM64A, HN1, not searching for symbol synonyms”
Regressing out S.Score, G2M.Score

Centering and scaling data matrix

Error in if (major_cancer == "Bile Duct Cancer Cell") {: the condition has length > 1
Traceback:

1. create_summary_matrix(prediction_list = cell_predictions, use_CNVs = F, 
 .     modify_results = T, mc.cores = 12, raw_counts = sparse_matrix, 
 .     min_prop = 0.5)

The code I am using is:

 cell_predictions <- run_scATOMIC(sparse_matrix, mc.cores = 12)

  # Create summary matrix
  results <- create_summary_matrix(prediction_list = cell_predictions, use_CNVs = F, 
                                    modify_results = T,
                                    mc.cores = 12, raw_counts = sparse_matrix,
                                    min_prop = 0.5)

Setting min_prop to zero could fix the problem?

Thanks in advance!

inofechm commented 1 year ago

Hi Cristian, This is a bug when scATOMIC has predicted exactly the same number of cells as two different cancer types, it generally only happens in cases where very few cancer cells were predicted. Thank you for raising the issue I've actually just run into a similar one in a development version of scATOMIC v2 that I have patched so I think I can fix this bug with the same patch. I've updated the package to hopefully resolve this bug. Could you please re-install the package with install_github and let me know whether your code works now?

The un-converted annotations (mixed cancer types) should be found in results$layer_6 Since you already know it is a brain tumour, I recommend either post-hoc conversion of the cells that receive a different cancer label to brain cancer, or just removing them from the analysis as they might represent an unclear cell type for scATOMIC. You could use a code like: index_cancer_predicted <- which(results$scATOMIC_pred %in% c("Bile Duct Cancer Cell", "Bladder Cancer Cell", "Bone Cancer Cell", "Brain Cancer Cell", "Breast Cancer Cell", "Colon/Colorectal Cancer Cell", "Endometrial/Uterine Cancer Cell", "Esophageal Cancer Cell", "Gallbladder Cancer Cell", "Gastric Cancer Cell", "Kidney Cancer Cell", "Liver Cancer Cell", "Lung Cancer Cell", "Neuroblastoma", "Ovarian Cancer Cell", "Pancreatic Cancer Cell", "Prostate Cancer Cell", "Sarcoma", "Skin Cancer Cell", "Breast/Lung/Prostate", "Ovarian/Endometrial/Kidney", "Endometrial Cancer Cell", "Biliary/Hepatic Cancer Cell", "Brain/Neuroblastoma Cancer Cell", "Digestive Tract Cancer Cell", "Soft Tissue Cancer Cell", "Soft Tissue or Neuro Cancer Cell", "Unclassified Soft Tissue or Neuro Cancer Cell", "ER+ Breast Cancer Cell", "HER2+ Breast Cancer Cell", "Her2+ Breast Cancer Cell", "TNBC Breast Cancer Cell", "Epithelial Cell", "GI Tract Cell")) results[index_cancer_predicted, "scATOMIC_pred"] <- "Brain Cancer Cell"

Best, Ido

ccruizm commented 1 year ago

Thanks for the speedy reply Ido. I will test it and let you know how it goes. Also, I appreciate the suggestion you made. I will implement it as well.

ccruizm commented 1 year ago

So far seems it is working! However I am facing another error

[1] "Starting Layer 1"
[1] "Done Layer 1"
[1] "Starting Layer 2 Non Blood"
Error in optim(par = c(mu1 = -1.17118298150295, sigma1 = -Inf, mu2 = -0.378059875049167, : non-finite value supplied by optim
Traceback:

1. run_scATOMIC(sparse_matrix, mc.cores = 12)
2. scATOMIC::classify_layer(rna_counts = rna_counts, cells_to_use = normal_tissue_cancer_predicted, 
 .     layer = "layer_2_non_blood", imputation = imputation, genes_in_model = top_genes_unlisted_layer_2_normal_tissue_cancer, 
 .     model = model_layer_2_normal_tissue_cancer, ref_based = ref_based, 
 .     mc.cores = mc.cores, unimodal_nsd = unimodal_nsd, bimodal_nsd = bimodal_nsd, 
 .     normalized_counts = normalized_counts)
3. get_auto_threshold(layer_predictions, mc.cores = mc.cores)
4. lapply(list_scores_for_cutoffs, scATOMIC::automatic_threshold, 
 .     unimodal_nsd = unimodal_nsd, bimodal_nsd = bimodal_nsd)
5. FUN(X[[i]], ...)
6. cutoff::em(score, "normal", "normal")
7. with(start, {
 .     while (abs(lambda0 - mean(lambda)) > t) {
 .         lambda <- mean(lambda)
 .         lambda0 <- lambda
 .         distr1 <- lambda * D1b(data, mu1, sigma1)
 .         distr2 <- (1 - lambda) * D2b(data, mu2, sigma2)
 .         lambda <- distr1/(distr1 + distr2)
 .         mLL2 <- function(mu1, sigma1, mu2, sigma2) return(mLL(mu1, 
 .             sigma1, mu2, sigma2, lambda, data, D1b, D2b))
 .         start <- as.list(log(c(mu1 = mu1, sigma1 = sigma1, mu2 = mu2, 
 .             sigma2 = sigma2)))
 .         out <- bbmle::mle2(mLL2, start, "Nelder-Mead")
 .         coef <- out@coef
 .         coef_n <- names(coef)
 .         names(coef) <- NULL
 .         for (i in 1:4) assign(coef_n[i], exp(coef[i]))
 .     }
 .     out <- list(lambda = lambda, param = exp(out@coef), D1 = D1, 
 .         D2 = D2, deviance = out@min, data = data, data_name = data_name, 
 .         out = out, t = t)
 .     class(out) <- "em"
 .     return(out)
 . })
8. with.default(start, {
 .     while (abs(lambda0 - mean(lambda)) > t) {
 .         lambda <- mean(lambda)
 .         lambda0 <- lambda
 .         distr1 <- lambda * D1b(data, mu1, sigma1)
 .         distr2 <- (1 - lambda) * D2b(data, mu2, sigma2)
 .         lambda <- distr1/(distr1 + distr2)
 .         mLL2 <- function(mu1, sigma1, mu2, sigma2) return(mLL(mu1, 
 .             sigma1, mu2, sigma2, lambda, data, D1b, D2b))
 .         start <- as.list(log(c(mu1 = mu1, sigma1 = sigma1, mu2 = mu2, 
 .             sigma2 = sigma2)))
 .         out <- bbmle::mle2(mLL2, start, "Nelder-Mead")
 .         coef <- out@coef
 .         coef_n <- names(coef)
 .         names(coef) <- NULL
 .         for (i in 1:4) assign(coef_n[i], exp(coef[i]))
 .     }
 .     out <- list(lambda = lambda, param = exp(out@coef), D1 = D1, 
 .         D2 = D2, deviance = out@min, data = data, data_name = data_name, 
 .         out = out, t = t)
 .     class(out) <- "em"
 .     return(out)
 . })
9. eval(substitute(expr), data, enclos = parent.frame())
10. eval(substitute(expr), data, enclos = parent.frame())
11. bbmle::mle2(mLL2, start, "Nelder-Mead")
12. do.call("optim", c(list(par = start, fn = objectivefunction, 
  .     method = method, hessian = FALSE, gr = objectivefunctiongr, 
  .     control = call$control, lower = call$lower, upper = call$upper), 
  .     arglist))
13. optim(par = c(mu1 = -1.17118298150295, sigma1 = -Inf, mu2 = -0.378059875049167, 
  . sigma2 = -5.02293932724777), fn = function (p) 
  . {
  .     if (browse_obj) 
  .         browser()
  .     l <- relist2(p, template)
  .     names(p) <- nstart[order(oo)]
  .     l[nfix] <- fixed
  .     if (vecpar) {
  .         l <- namedrop(l[nfull])
  .         l <- unlist(l)
  .         args <- list(l)
  .         args <- c(list(l), args.in.data)
  .     }
  .     else {
  .         args <- c(l, args.in.data)
  .     }
  .     if (namedrop_args) 
  .         args <- namedrop(args)
  .     do.call("minuslogl", args)
  . }, method = "Nelder-Mead", hessian = FALSE, gr = NULL, control = list(), 
  .     lower = -Inf, upper = Inf)

Where do you think the issue is?

ccruizm commented 1 year ago

Also after the update I am getting this error in samples I did not have an issue before

[1] "Starting Layer 1"
[1] "Done Layer 1"
[1] "Starting Layer 2 Non Blood"
[1] "Done Layer 2 Non Blood"
[1] "Starting Layer 3 Non Stromal"
Warning message in xtfrm.data.frame(x):
“cannot xtfrm data frames”
Warning message in xtfrm.data.frame(x):
“cannot xtfrm data frames”
[1] "Done Layer 3 Non Stromal"
[1] "Starting Layer 4 Non GI"
Warning message in xtfrm.data.frame(x):
“cannot xtfrm data frames”
Warning message in xtfrm.data.frame(x):
“cannot xtfrm data frames”
[1] "Done Layer 4 Non GI"
[1] "Starting Layer 5 Breast Lung Prostate"
[1] "nothing to score in this layer"
Warning message in xtfrm.data.frame(x):
“cannot xtfrm data frames”
Warning message in xtfrm.data.frame(x):
“cannot xtfrm data frames”
[1] "Done Layer 5 Breast Lung Prostate"
[1] "Starting Layer 2 Blood"
[1] "Done Layer 2 Blood"
[1] "Starting Layer 3 Myeloid"
[1] "Done Layer 3 Myeloid"
[1] "Starting Layer 4 Macrophage Monocyte"
[1] "nothing to score in this layer"
[1] "Done Layer 4 Macrophage Monocyte"
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“pseudoinverse used at -1.7634”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“neighborhood radius 0.30103”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“reciprocal condition number  5.4651e-16”
Warning message:
“The following features are not present in the object: MLF1IP, not searching for symbol synonyms”
Warning message:
“The following features are not present in the object: FAM64A, HN1, not searching for symbol synonyms”
Regressing out S.Score, G2M.Score

Centering and scaling data matrix

Error in apply(count_matrix_up_reg, 1, function(x) {: dim(X) must have a positive length
Traceback:

1. create_summary_matrix(prediction_list = cell_predictions, use_CNVs = F, 
 .     modify_results = T, mc.cores = 12, raw_counts = sparse_matrix, 
 .     min_prop = 0.5)
2. apply(count_matrix_up_reg, 1, function(x) {
 .     length(which(x != 0))
 . })
3. stop("dim(X) must have a positive length")
inofechm commented 1 year ago

Hi Cristian, Sorry for the bugs! I have patched the second bug you reported, and the create_summary_matrix function should now work on that dataset, please update and try again.

Regarding the bug with optim(par = c(mu1 = -1.17118298150295, sigma1 = -Inf, mu2 = -0.378059875049167, : non-finite value supplied by optim: I am not exactly sure what is going on, but I think that there are very high confidence scores being generated at the second classification layer that is causing an issue with the distribution for setting automatic cutoffs. I think it may be possible to just re-run the code and it may not come up again. Alternatively, you can avoid this error by setting the confidence_cutoff = F in run_scATOMIC() and in create_summary_matrix() and just bypass this cutoff entirely. Is this on a publicly available dataset? If I can take a look at what exactly is going on I will be better able to diagnose the issue.

Thanks!

ccruizm commented 1 year ago

No worries Ido. I appreciate how fast you are fixing the issues. Thansk for that :)

So far it has worked and have not encountered the previous problems. As you suggested re-running the code did not throw any error.

Now I am facing another error:


[1] "Starting Layer 1"
[1] "Done Layer 1"
[1] "Starting Layer 2 Non Blood"
[1] "Done Layer 2 Non Blood"
[1] "Starting Layer 3 Non Stromal"
[1] "Done Layer 3 Non Stromal"
[1] "Starting Layer 4 Non GI"
[1] "Done Layer 4 Non GI"
[1] "Starting Layer 5 Breast Lung Prostate"
[1] "nothing to score in this layer"
[1] "Done Layer 5 Breast Lung Prostate"
[1] "Starting Layer 3 Normal Stromal"
[1] "Done Layer 3 Normal Stromal"
[1] "Starting Layer 2 Blood"
[1] "Done Layer 2 Blood"
[1] "Starting Layer 3 TNK"
[1] "Done Layer 3 TNK"
[1] "Starting Layer 4 CD4 CD8"
[1] "nothing to score in this layer"
Warning message in parallel::mclapply(row.names(predictions), scATOMIC::class_for_cutoff, :
“scheduled core 3 encountered error in user code, all values of the job will be affected”
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 404, 405
Traceback:

1. run_scATOMIC(sparse_matrix, mc.cores = 8)
2. scATOMIC::classify_layer(rna_counts = rna_counts, cells_to_use = CD4_CD8_cells_predicted, 
 .     layer = "layer_4_CD4_CD8", imputation = F, genes_in_model = top_genes_unlisted_layer_4_CD4_CD8, 
 .     model = model_layer_4_CD4_CD8, ref_based = ref_based, mc.cores = mc.cores, 
 .     unimodal_nsd = unimodal_nsd, bimodal_nsd = bimodal_nsd, normalized_counts = normalized_counts)
3. get_auto_threshold(layer_predictions, mc.cores = mc.cores)
4. cbind(na.omit(predictions), na.omit(cutoff_class))
5. cbind(deparse.level, ...)
6. data.frame(..., check.names = FALSE)
7. stop(gettextf("arguments imply differing number of rows: %s", 
 .     paste(unique(nrows), collapse = ", ")), domain = NA)
​```

Initially thought was related to the number of cores requested and reduced it but still same error. And it is a sample that has only 1800 cells. Hope you can work your magic to fix this one too :)

Thanks again!
inofechm commented 1 year ago

Hi again, I am glad some of these fixes are working, regarding this other error it seems like one cell has zero for all genes involved in this layer and leading to an error. I think I have added a work around to omit this cell from the get_auto_threshold function now if you want to update and try again on this dataset. Let me know if this works. Thanks for bringing all these bugs, it really helps make sure the tool is usable for everyone! Ido

ccruizm commented 1 year ago

I could get the predictions for all the samples! the fixes you made all work! thanks for that. I am closing this issue.

Great support!