campbio / celda

Bayesian Hierarchical Modeling for Clustering Single Cell Genomic Data
http://bioconductor.org/packages/celda
MIT License
148 stars 27 forks source link

Problem Running decontX with z argument #320

Closed astarr97 closed 3 years ago

astarr97 commented 3 years ago

Hello,

I am trying to run decontX on a counts matrix (called count_mat) with dim 1819 x 16382. My z argument is a vector (called names) of length 1819 containing cluster names. However when I run celda::decontX(count_mat, z = names) I get the error "Error in .processCellLabels(z, numCells = nC) : 'z' must be of the same length as the number of cells in the 'counts' matrix.". Because of this, I tried running it on the transpose of the counts matrix: celda::decontX(t(count_mat), z = names). Doing this, I get the error: "Error in rowSums(phi) : 'x' must be an array of at least two dimensions". First, should rows be the cells or should rows be the genes? Second, any idea how I can fix these errors?

Best, Alex.

zhewa commented 3 years ago

Hi Alex,

The rows should be genes and columns should be cells in the input counts matrix. Your second error seems to be dataset dependent. Make sure that no rows or columns in the counts matrix are all zeros. Also check if input variables count_mat and t(count_mat) are classes of matrix with 2 dimensions. Let us know if the error persists. If possible, can you provide a reproducible example script of the error so it would be easier for us to pinpoint where the program went wrong?

astarr97 commented 3 years ago

I will check those things, thank you for the very quick response! By reproducible example script, should it include my data?

zhewa commented 3 years ago

Yes, that would work, if you feel open doing that. A simulated data which generates this error would also work.

astarr97 commented 3 years ago

Okay I may try to simulate it, although it looks like there are zero columns so hopefully that is it!

astarr97 commented 3 years ago

That seems to have resolved it, although now RStudio is encountering a fatal error every time I run it.

astarr97 commented 3 years ago

Does decontX require an extreme amount of RAM? I cannot figure out why R encounters a fatal error when I run the program with z, but runs fine when I don't enter z.

zhewa commented 3 years ago

Hi @astarr97,

decontX do convert input matrix object to a sparse matrix dgCMatrix before calculation. A RAM of 8GB or more should be enough to deal with a dataset of size 16382 x 1819. I recommend making sure no unnecessary objects are loaded in the memory by executing rm(list = ls()) before your analysis. This removes all defined variables in the global environment and frees up memory. Also, if you are using Rstudio, uncheck Restore .RData into workspace at startup to ensure no unwanted workspace is loaded at Rstudio startup. Also, can you paste the fatal error message so we can be sure it's a memory related error?

Another thing you can check is how you defined names for the argument z. Make sure the total number of cell clusters is as you expected by executing length(unique(names)) in R.

astarr97 commented 3 years ago

I will try those things, thank you. Although I have been sure to clear my global environment each time at run time. In addition, it terminates with a fatal error within a few seconds so it may not be a RAM issue. For the z argument, it should be the same length as the number of cells, right? And the input counts matrix should just be the counts without column or row labels? Thank you again.

zhewa commented 3 years ago

Yes, z should have the same length as the number of columns in x. You can have row names and column names for x as long as these labels are in the rownames(x) and colnames(x) slots and not an actual row or column in x.

astarr97 commented 3 years ago

Okay, it is still erroring. I attached the error message and an image of my code (it is the last line that errors). I am checking with a collaborator to see if it is okay to send you the data itself or a simulated version. Thank you so much for your help so far! code

Error_Message

joshua-d-campbell commented 3 years ago

Hi @astarr97, I'm sorry this error is occurring for you. It would be great for us to figure out what it is so we can add a check to ensure we don't crash R! A few other questions for you:

1) Which version of R and celda are you using. You can run sessionInfo() and send us the results. If you are not using the latest version, I would highly recommend updating it. 2) Do you have any NA's in your matrix or names vector? Or characters or negative numbers in your matrix? 3) What is the storage mode of you matrix. You can check this by running storage.mode(counts_mat) 4) If you can't send us the data directly, can you show us the first few entries in the matrix and names vector?

counts_mat[1:5,1:5] names_vector[1:5]

astarr97 commented 3 years ago

Hello, here is the output of sessionInfo(): R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding

locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets [6] methods base

other attached packages: [1] MASS_7.3-53 celda_1.2.4 optparse_1.6.6

loaded via a namespace (and not attached): [1] ggrepel_0.9.1 Rcpp_1.0.6
[3] lattice_0.20-41 digest_0.6.27
[5] assertthat_0.2.1 foreach_1.5.1
[7] SingleCellExperiment_1.8.0 R6_2.5.0
[9] GenomeInfoDb_1.22.1 plyr_1.8.6
[11] stats4_3.6.1 httr_1.4.2
[13] ggplot2_3.3.3 pillar_1.4.7
[15] zlibbioc_1.32.0 rlang_0.4.10
[17] rstudioapi_0.13 data.table_1.13.6
[19] S4Vectors_0.24.4 Matrix_1.3-2
[21] combinat_0.0-8 BiocParallel_1.20.1
[23] Rtsne_0.15 stringr_1.4.0
[25] RcppEigen_0.3.3.9.1 RCurl_1.98-1.2
[27] munsell_0.5.0 uwot_0.1.10
[29] DelayedArray_0.12.3 compiler_3.6.1
[31] pkgconfig_2.0.3 BiocGenerics_0.32.0
[33] tidyselect_1.1.0 SummarizedExperiment_1.16.1 [35] tibble_3.0.5 gridExtra_2.3
[37] GenomeInfoDbData_1.2.2 enrichR_2.1
[39] IRanges_2.20.2 codetools_0.2-18
[41] matrixStats_0.57.0 dendextend_1.14.0
[43] viridisLite_0.3.0 withr_2.4.0
[45] crayon_1.3.4 dplyr_1.0.3
[47] MCMCprecision_0.4.0 bitops_1.0-6
[49] MAST_1.12.0 grid_3.6.1
[51] gtable_0.3.0 lifecycle_0.2.0
[53] DBI_1.1.1 magrittr_2.0.1
[55] pROC_1.17.0.1 scales_1.1.1
[57] stringi_1.5.3 XVector_0.26.0
[59] reshape2_1.4.4 viridis_0.5.1
[61] doParallel_1.0.16 getopt_1.20.3
[63] ggdendro_0.1.22 ellipsis_0.3.1
[65] generics_0.1.0 vctrs_0.3.6
[67] rjson_0.2.20 RColorBrewer_1.1-2
[69] iterators_1.0.13 tools_3.6.1
[71] Biobase_2.46.0 glue_1.4.2
[73] purrr_0.3.4 abind_1.4-5
[75] parallel_3.6.1 yaml_2.2.1
[77] colorspace_2.0-0 BiocManager_1.30.10
[79] GenomicRanges_1.38.0

Here is the output of running > storage.mode(count_mat) [1] "integer"

And here are the first few entries of count_mat: [,1] [,2] [,3] [,4] [,5] V1 0 0 0 0 0 V2 0 0 0 0 0 V3 0 0 0 0 0 V4 1 3 0 1 0 V5 0 0 1 2 1 And names_vector: [1] MARS_13 MARS_4 MARS_11 MARS_7 MARS_3 26 Levels: MARS_0 MARS_1 MARS_10 MARS_11 MARS_12 ... MARS_9

Thank you for including the commands, definitely saved me some time.

Best, Alex.

joshua-d-campbell commented 3 years ago

Okay, it looks like you are running an older version and chances are this bug has already been fixed. The latest version is in Bioconductor 3.12 and needs R 4.0.2 to run:

https://www.bioconductor.org/packages/release/bioc/html/celda.html

astarr97 commented 3 years ago

Great I think that did it. Is this expected output with the z argument included? It is running much faster than the previous version and says that it was "Generating UMAP" so I am concerned the z argument didn't take for some reason (see attached output).

Best, Alex. output_expected

joshua-d-campbell commented 3 years ago

Yes, this is correct. It still generates a UMAP for visualization but not for clustering. Just let us know if you have other issues and make sure you read over the new vignette as there are several new plotting utilities to help in the assessment of contamination:

https://www.bioconductor.org/packages/release/bioc/vignettes/celda/inst/doc/decontX.pdf

astarr97 commented 3 years ago

Thank you so much for developing and maintaining this software!

astarr97 commented 3 years ago

As a final question, is there any way to get decontX to deal with contamination that is present in all cell clusters? There is a "hidden" cluster (we didn't collect these cells, but they still contaminate our data) that has a few genes that are so highly expressed they contaminate our data.

joshua-d-campbell commented 3 years ago

Hi @astarr97, this is a good suggestion and something we are actively working on. We hope to push an update in another month or so that will allow you to estimate a contamination distribution directly from empty droplets.

joshua-d-campbell commented 3 years ago

Hi @astarr97, we have just pushed a new version of decontX to the master branch that allows you to give the raw matrix as a second parameter. The ambient RNA distribution is estimated from the empty drops which should contain counts from your "hidden" population.

Note that you will have to reinstall celda from GitHub and build the vignette locally to see an example:

library(devtools)
install_github("campbio/celda", build_vignettes = TRUE)
vignette("decontX")