UcarLab / AMULET

A count based method for detecting doublets from single nucleus ATAC-seq (snATAC-seq) data.
https://ucarlab.github.io/AMULET/
GNU General Public License v3.0
29 stars 5 forks source link

Troubles with running doublet annotation part #18

Open AtaiDobrynin opened 2 years ago

AtaiDobrynin commented 2 years ago

Hi, I tried to run your doublet annotation functions but I encountered several errors. Firstly, I tried to run your getMarkerPeaks function:

marker_peaks <- getMarkerPeaks(data, doublets = unlist(multiplets), 
                               n_peaks = 100, 
                               min_cells = 200)
Error in order(meta_peaks$avg_logFC, -meta_peaks$p_val_adj, decreasing = T): argument 1 is not a vector
Traceback:

1. getMarkerPeaks(data, doublets = unlist(multiplets), n_peaks = 100, 
 .     min_cells = 200)
2. meta_peaks[order(meta_peaks$avg_logFC, -meta_peaks$p_val_adj, 
 .     decreasing = T), ]
3. `[.data.frame`(meta_peaks, order(meta_peaks$avg_logFC, -meta_peaks$p_val_adj, 
 .     decreasing = T), )
4. order(meta_peaks$avg_logFC, -meta_peaks$p_val_adj, decreasing = T)

After that I generated marker peaks by myself and used them in your annotateDoublets function:

multiplet_annotations <- annotateDoublets(obj = data, marker_peaks = df, doublets = multiplets)
Error in rep(t_read_counts$ids[j], t_read_counts[j, row.names(probs)[i]]): invalid 'times' argument
Traceback:

1. annotateDoublets(obj = data, marker_peaks = df, doublets = multiplets)
2. getCellValues(obj, cells = Cells(obj), marker_peaks_set = marker_peaks, 
 .     doublets = doublets, k = k)
3. lapply(cells, function(cell) {
 .     neighbors <- names(obj@graphs$ATAC_nn[cell, obj@graphs$ATAC_nn[cell, 
 .         ] > 0])
 .     reads <- Matrix::as.matrix(subset(obj, cells = neighbors, 
 .         features = marker_peaks_set$peaks)@assays[["ATAC"]]@counts)
 .     no_clusters <- length(unique(marker_peaks_set$cluster))
 .     results = data.frame(matrix(nrow = 0, ncol = no_clusters + 
 .         1)) %>% `colnames<-`(value = c("cell_id", as.character(unique(marker_peaks_set$cluster))))
 .     results[cell, "cell_id"] = cell
 .     reads <- data.frame(apply(reads, 1, mean)) %>% `colnames<-`(value = cell)
 .     if (colSums(reads) == 0) {
 .         results[, -1] <- 0
 .         results[cell, "a.heterotypic"] <- NA
 .         results[cell, "a.homotypic"] <- NA
 .         return(results)
 .     }
 .     doublet_probs <- reads %>% getReadCountDistributions(marker_peaks_set, 
 .         .) %>% data.frame()
 .     results[cell, colnames(doublet_probs)] <- doublet_probs
 .     results[cell, "a.heterotypic"] <- paste(names(doublet_probs)[order(doublet_probs, 
 .         decreasing = T)[1:2]], collapse = ".")
 .     results[cell, "a.homotypic"] <- names(which.max(doublet_probs))
 .     return(results)
 . }) %>% do.call(rbind, .)
4. do.call(rbind, .)
5. lapply(cells, function(cell) {
 .     neighbors <- names(obj@graphs$ATAC_nn[cell, obj@graphs$ATAC_nn[cell, 
 .         ] > 0])
 .     reads <- Matrix::as.matrix(subset(obj, cells = neighbors, 
 .         features = marker_peaks_set$peaks)@assays[["ATAC"]]@counts)
 .     no_clusters <- length(unique(marker_peaks_set$cluster))
 .     results = data.frame(matrix(nrow = 0, ncol = no_clusters + 
 .         1)) %>% `colnames<-`(value = c("cell_id", as.character(unique(marker_peaks_set$cluster))))
 .     results[cell, "cell_id"] = cell
 .     reads <- data.frame(apply(reads, 1, mean)) %>% `colnames<-`(value = cell)
 .     if (colSums(reads) == 0) {
 .         results[, -1] <- 0
 .         results[cell, "a.heterotypic"] <- NA
 .         results[cell, "a.homotypic"] <- NA
 .         return(results)
 .     }
 .     doublet_probs <- reads %>% getReadCountDistributions(marker_peaks_set, 
 .         .) %>% data.frame()
 .     results[cell, colnames(doublet_probs)] <- doublet_probs
 .     results[cell, "a.heterotypic"] <- paste(names(doublet_probs)[order(doublet_probs, 
 .         decreasing = T)[1:2]], collapse = ".")
 .     results[cell, "a.homotypic"] <- names(which.max(doublet_probs))
 .     return(results)
 . })
6. FUN(X[[i]], ...)
7. reads %>% getReadCountDistributions(marker_peaks_set, .) %>% 
 .     data.frame()
8. data.frame(.)
9. getReadCountDistributions(marker_peaks_set, .)

I used ATAC peaks from 10X genomics and successfully generated file with possible multiplet barcodes, but on annotation step code stops working. Have you encountered this behavior? Thanks in advance!

alperoglu commented 2 years ago

Hi Atai,

Thank you for using the tool and bringing these to our attention. This pipeline was written for Seurat v3.0 and I have to update it to work with the updated functions from Seurat v4.0. The first error is likely due to that. In the new FindMarkersAll function from Seurat, the fold change values resulting DA on the peaks are returned with a column names "avg_log2FC". And this function is using the old column name. For the second error, i'm not sure at the moment. I'll rerun these functions on Signac's test data and find the bugs. And I'll keep you updated.

Thank you again for letting us now.

Best, Alper

alperoglu commented 2 years ago

Hi @AtaiDobrynin ,

I'm going through the code but I wasn't able to reproduce this error with the data that they have. It is possible that this is happening because you have negative or NA values in the raw counts slot of your Seurat object. I will handle those exceptions going forward, but I want to make sure that's the case for you as well and not something else. If you have time can you check if you have any NA or negative values in your count matrix?

AtaiDobrynin commented 2 years ago

Hi @alperoglu,

I checked my count matrix and didn't find any NA or negative values.

Best, Atai

alperoglu commented 2 years ago

Hi Atai,

I've updated the scripts used by the doublet annotation pipeline to the latest versions of Seurat and Signac, you can retry running it with these updates. Let me know if you still get this error or not, because I've included a part the error was occurring to capture it and identify the reason.

Thank you, Alper

AtaiDobrynin commented 2 years ago

Hi Alper,

Thank you for your reply, I will try your new version soon.

Sincerely, Atai

AtaiDobrynin commented 2 years ago

Hi Alper,

I tried to run newer version of your functions, but I still get some errors, sorry. Could you please share R vignette with data processing? For example, for open source data like PBMC 10X (https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-no-cell-sorting-10-k-1-standard-2-0-0)?

Sincerely, Atai

alperoglu commented 2 years ago

Hi Atai,

You can follow the Signac vignette here: https://satijalab.org/signac/articles/pbmc_vignette.html.

The functions that we're using and data structures are based on this R package. What are the errors that you're currently getting?

Best, Alper

AtaiDobrynin commented 2 years ago

Hi Alper,

Thanks for answering, I followed the vignette above. During the doublet annotation step I got this error `Error in if (G[1] == 1) {: missing value where TRUE/FALSE needed Traceback:

  1. annotateDoublets(obj = data, marker_peaks = marker_peaks, doublets = multiplets)
  2. lapply(clusters, function(clus) { . t.doublets <- doublets %>% subset(a.homotypic == clus, select = clusters) . t.profile <- singlet.profile[, clus] . t.dist <- apply(t.doublets, 1, function(cell) { . return(dist(rbind(cell, t.profile))) . }) . fit <- Mclust(t.dist, verbose = F) . t.class <- fit$classification . t.doublets$dist <- t.dist[rownames(t.doublets)] . t.doublets$class <- t.class[rownames(t.doublets)] . t.doublets$max.class <- names(which.max(fit[["parameters"]][["mean"]])) . t.doublets$num.class <- length(unique(fit$classification)) . return(t.doublets[, c("dist", "class", "max.class", "num.class")]) . }) %>% bind_rows()
  3. bind_rows(.)
  4. list2(...)
  5. lapply(clusters, function(clus) { . t.doublets <- doublets %>% subset(a.homotypic == clus, select = clusters) . t.profile <- singlet.profile[, clus] . t.dist <- apply(t.doublets, 1, function(cell) { . return(dist(rbind(cell, t.profile))) . }) . fit <- Mclust(t.dist, verbose = F) . t.class <- fit$classification . t.doublets$dist <- t.dist[rownames(t.doublets)] . t.doublets$class <- t.class[rownames(t.doublets)] . t.doublets$max.class <- names(which.max(fit[["parameters"]][["mean"]])) . t.doublets$num.class <- length(unique(fit$classification)) . return(t.doublets[, c("dist", "class", "max.class", "num.class")]) . })
  6. FUN(X[[i]], ...)
  7. Mclust(t.dist, verbose = F)
  8. eval(mc, parent.frame())
  9. eval(mc, parent.frame())
  10. mclustBIC(data = structure(numeric(0), .Dim = 0:1), verbose = F)
HanneHjorthaug commented 2 years ago

Hi,

I am seeing the same error as Atai, and hope that the information below could help in finding out what’s going on. Before describing that error, I will start with another error that I have met: I have run the Amulet code implemented in scDblFinder, using R (on Windows), and have a vector with doublets that I am trying to visualize together with my Seurat object, following your Doublet Annotation Pipline vignette. The getMarkerPeaks() function works, and my output looks like the marker_peaks object in your vignette. When first trying to run annotateDoublets() I got this error:

Error in h(simpleError(msg, call)) : error in evaluating the argument 'args' in selecting a method for function 'do.call': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': argument is of length zero 13.h(simpleError(msg, call)) 12..handleSimpleError(function (cond) .Internal(C_tryCatchHelper(addr, 1L, cond)), "error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': argument is of length zero", base::quote(h(simpleError(msg, call)))) 11.h(simpleError(msg, call)) 10..handleSimpleError(function (cond) .Internal(C_tryCatchHelper(addr, 1L, cond)), "argument is of length zero", base::quote(if (is.na(round(as.numeric(t_read_counts[j, row.names(probs)[i]])))) { print(paste0("NA/NaN times argument given at peak ", ... at utilities.R#23 9.getReadCountDistributions(marker_peaks = marker_peaks_set, read_counts = reads) 8.as.data.frame(getReadCountDistributions(marker_peaks = marker_peaks_set, read_counts = reads)) at utilities.R#131 7.FUN(X[[i]], ...) 6.lapply(cells, function(cell) { neighbors <- names(obj@graphs[[paste0(assay, "_nn")]][cell, obj@graphs[[paste0(assay, "_nn")]][cell, ] > 0]) reads <- Matrix::as.matrix(subset(obj, cells = neighbors, ... 5.lapply(cells, function(cell) { neighbors <- names(obj@graphs[[paste0(assay, "_nn")]][cell, obj@graphs[[paste0(assay, "_nn")]][cell, ] > 0]) reads <- Matrix::as.matrix(subset(obj, cells = neighbors, ... 4.do.call(rbind, .) 3.lapply(cells, function(cell) { neighbors <- names(obj@graphs[[paste0(assay, "_nn")]][cell, obj@graphs[[paste0(assay, "_nn")]][cell, ] > 0]) reads <- Matrix::as.matrix(subset(obj, cells = neighbors, ... at utilities.R#102 2.getCellValues(obj, cells = Cells(obj), marker_peaks_set = marker_peaks, doublets = doublets, k = k, assay = assay) at annotateDoublets_method.R#10 1.annotateDoublets(obj = TSC3075Dbl_noZero, marker_peaks = marker_peaks3075, doublets = AmuDbl_3075)

I did a troubleshooting by running utilities.R, chunk by chunk, and found that if I changed the code in line 23, 28, 33, 38 and 40 from “..t_read_counts[j, row.names(probs)[i]]…” to “…t_read_counts[j, 3]…” , I got the code to run through these lines with no errors. For instance, “t_read_counts[5, row.names(probs)[1]]” will return “NULL”, while t_read_counts[5, 3] will return the number in row 5, column 3. As I understand the code, the probs object will only hold one row, and for each cell processed by the code, its row name will correspond to the column name for column number three in t_read_counts. Am I right? And if so, is it correct to do the changes that I’ve done in line 23, 28, 33, 38 and 40? I then ran annotateDoublets() again with the above changes in utilities.R. I then met the error that was also reported by Atai:

Error in if (G[1] == 1) { : missing value where TRUE/FALSE needed 11.mclustBIC(data = structure(numeric(0), .Dim = 0:1), verbose = F) 10.eval(mc, parent.frame()) 9.eval(mc, parent.frame()) 8.Mclust(t.dist, verbose = F) at annotateDoublets_method.R#39 7.FUN(X[[i]], ...) 6.lapply(clusters, function(clus) { t.doublets <- doublets %>% subset(a.homotypic == clus, select = clusters) t.profile <- singlet.profile[, clus] t.dist <- apply(t.doublets, 1, function(cell) { ... 5.lapply(clusters, function(clus) { t.doublets <- doublets %>% subset(a.homotypic == clus, select = clusters) t.profile <- singlet.profile[, clus] t.dist <- apply(t.doublets, 1, function(cell) { ... 4.list2(...) 3.bind_rows(.) 2.lapply(clusters, function(clus) { t.doublets <- doublets %>% subset(a.homotypic == clus, select = clusters) t.profile <- singlet.profile[, clus] t.dist <- apply(t.doublets, 1, function(cell) { ... at annotateDoublets_method.R#22 1.annotateDoublets(obj = TSC3075Dbl_noZero, marker_peaks = marker_peaks3075, doublets = AmuDbl_3075) When the code is running there are multiple lines of this warning coming up: “Warning in xtfrm.data.frame(x) : cannot xtfrm data frames” . Could this have something to do with calling order() on a data.frame (It Has Always Been Wrong to Call order on a data.frame – Win Vector LLC (win-vector.com)), which seems to be done in utilities.R line 136?

Info on my session in RStudio:

sessionInfo() R version 4.1.2 (2021-11-01) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale: [1] LC_COLLATE=Norwegian Bokmål_Norway.1252 LC_CTYPE=Norwegian Bokmål_Norway.1252 LC_MONETARY=Norwegian Bokmål_Norway.1252 [4] LC_NUMERIC=C LC_TIME=Norwegian Bokmål_Norway.1252

attached base packages: [1] stats4 grid parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] mclust_5.4.10 MASS_7.3-54 textclean_0.9.3 tidyr_1.1.3 dplyr_1.0.5
[6] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0 Biobase_2.54.0 GenomicRanges_1.42.0 GenomeInfoDb_1.30.0
[11] IRanges_2.28.0 S4Vectors_0.32.3 BiocGenerics_0.40.0 MatrixGenerics_1.6.0 matrixStats_0.61.0
[16] stringr_1.4.0 png_0.1-7 ggplot2_3.3.5 SeuratObject_4.0.2 Seurat_4.0.5
[21] Signac_1.4.0 doParallel_1.0.17 iterators_1.0.13 foreach_1.5.1

loaded via a namespace (and not attached): [1] fastmatch_1.1-0 plyr_1.8.6 igraph_1.2.6 lazyeval_0.2.2 splines_4.1.2 BiocParallel_1.24.1
[7] listenv_0.8.0 scattermore_0.7 SnowballC_0.7.0 digest_0.6.27 htmltools_0.5.2 fansi_0.4.2
[13] magrittr_2.0.1 tensor_1.5 cluster_2.1.2 ROCR_1.0-11 globals_0.14.0 Biostrings_2.58.0
[19] docopt_0.7.1 spatstat.sparse_2.0-0 colorspace_2.0-0 ggrepel_0.9.1 xfun_0.30 sparsesvd_0.2
[25] crayon_1.5.0 RCurl_1.98-1.3 jsonlite_1.7.2 spatstat.data_2.1-2 survival_3.2-13 zoo_1.8-9
[31] glue_1.4.2 polyclip_1.10-0 gtable_0.3.0 zlibbioc_1.36.0 XVector_0.30.0 leiden_0.3.9
[37] DelayedArray_0.20.0 future.apply_1.8.1 abind_1.4-5 scales_1.1.1 qdapRegex_0.7.5 DBI_1.1.2
[43] miniUI_0.1.1.1 Rcpp_1.0.7 viridisLite_0.4.0 xtable_1.8-4 reticulate_1.20 spatstat.core_2.1-2
[49] htmlwidgets_1.5.4 httr_1.4.2 RColorBrewer_1.1-2 ellipsis_0.3.2 ica_1.0-2 pkgconfig_2.0.3
[55] farver_2.1.0 ggseqlogo_0.1 uwot_0.1.10 deldir_0.2-10 utf8_1.2.1 tidyselect_1.1.2
[61] rlang_1.0.2 reshape2_1.4.4 later_1.2.0 munsell_0.5.0 tools_4.1.2 cli_3.1.0
[67] generics_0.1.2 ggridges_0.5.3 evaluate_0.15 fastmap_1.1.0 yaml_2.2.1 goftest_1.2-2
[73] knitr_1.37 fitdistrplus_1.1-8 purrr_0.3.4 RANN_2.6.1 pbapply_1.5-0 future_1.24.0
[79] nlme_3.1-152 mime_0.10 slam_0.1-48 RcppRoll_0.3.0 compiler_4.1.2 rstudioapi_0.13
[85] plotly_4.10.0 spatstat.utils_2.1-0 tibble_3.1.1 tweenr_1.0.2 stringi_1.5.3 lattice_0.20-45
[91] Matrix_1.4-1 vctrs_0.3.8 pillar_1.7.0 lifecycle_1.0.1 spatstat.geom_2.1-0 lmtest_0.9-38
[97] RcppAnnoy_0.0.18 data.table_1.14.0 cowplot_1.1.1 bitops_1.0-7 irlba_2.3.3 httpuv_1.6.1
[103] patchwork_1.1.1 R6_2.5.1 promises_1.2.0.1 KernSmooth_2.23-20 gridExtra_2.3 lsa_0.73.2
[109] parallelly_1.30.0 codetools_0.2-18 assertthat_0.2.1 withr_2.5.0 qlcMatrix_0.9.7 sctransform_0.3.2
[115] Rsamtools_2.6.0 GenomeInfoDbData_1.2.7 mgcv_1.8-38 rpart_4.1-15 rmarkdown_2.13 Rtsne_0.15
[121] aggregation_1.0.1 ggforce_0.3.3 shiny_1.7.1

Best, Hanne

alperoglu commented 2 years ago

Hi Atai and Hanne,

I was a little busy last week but know I'll try to sort this out. It seems to be an issue with mclust I'll try regenerating the error somehow. Hanne for your initial problem, can you share the head of the output data frame you got from getMarkerPeaks? You can blackout the peaks if you want, I just want to make sure it's format is correct.

Best, Alper

HanneHjorthaug commented 2 years ago

Hi Alper,

Great that you will look into the issues. Unfortunately, I am not able to send the head of the dataframe for a while, I just went on a four weeks holiday. I will get back to you when I get home. I know that I had the same columns as in the vignette, but my peaks had "-" (and not ":") after "chr", matching the rownames of my Seurat object.

Best, Hanne

HanneHjorthaug commented 2 years ago

Hi Alper,

This is my output from getMarkerPeaks:

head(marker_peaks)

image

Best, Hanne

HanneHjorthaug commented 2 years ago

Hi Alper,

Any luck in solving the above issues?

Best, Hanne