JSB-UCLA / scDEED

Single-cell dubious embedding detector (scDED): a statistical method for detecting dubious non-linear embeddings
MIT License
30 stars 4 forks source link

Error in scDEED #2

Closed yollct closed 4 months ago

yollct commented 10 months ago

Hi scDeed authors,

Thank you for the cool tool. But it seems I couldn't get it run. I tried using the example data and data from 10x:

  1. When I run umap_example <- scDEED(input_counts , num_pc = 16, use_method = "umap",visualization = TRUE) it returned an error:

Error in identical(input_data, input_counts) && perplexity == c(seq(from = 20, : 'length = 22' in coercion to 'logical(1)' Traceback:

  1. scDEED(input_counts, num_pc = 16, use_method = "umap", visualization = TRUE)
  1. When I use data from 10x and run scDEED, it gave me this error, which is encountered after the message 'Optimization finished' :

Error in Permuted(input_data): no slot of name "scale.data" for this object of class "Assay5" Traceback:

  1. scDEED(brain_sce, num_pc = 15, use_method = "umap", visualization = TRUE)
  2. Permuted(input_data)

I was actually running this on a spatial data matrix (filtered): https://www.10xgenomics.com/resources/datasets/mouse-brain-serial-section-2-sagittal-anterior-1-standard-1-1-0

Thank you!

Best, Chit Tong

This is the R environment I am using:

R version 4.3.1 (2023-06-16) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Ubuntu 22.04.1 LTS

Matrix products: default BLAS/LAPACK: /nfs/home/students/chit/.conda/envs/nease/envs/sc/lib/libopenblasp-r0.3.24.so; LAPACK version 3.11.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Berlin tzcode source: system (glibc)

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] devtools_2.4.5 usethis_2.2.2

loaded via a namespace (and not attached): [1] miniUI_0.1.1.1 compiler_4.3.1 crayon_1.5.2 promises_1.2.1
[5] Rcpp_1.0.11 stringr_1.5.0 callr_3.7.3 later_1.3.1
[9] fastmap_1.1.1 mime_0.12 R6_2.5.1 curl_5.1.0
[13] htmlwidgets_1.6.2 desc_1.4.2 profvis_0.3.8 rprojroot_2.0.4
[17] shiny_1.7.5.1 rlang_1.1.2 cachem_1.0.8 stringi_1.7.12
[21] httpuv_1.6.12 fs_1.6.3 pkgload_1.3.3 memoise_2.0.1
[25] cli_3.6.1 magrittr_2.0.3 ps_1.7.5 digest_0.6.33
[29] processx_3.8.2 xtable_1.8-4 remotes_2.4.2.1 lifecycle_1.0.4
[33] prettyunits_1.2.0 vctrs_0.6.4 glue_1.6.2 urlchecker_1.0.1 [37] sessioninfo_1.2.2 pkgbuild_1.4.2 purrr_1.0.2 tools_4.3.1
[41] ellipsis_0.3.2 htmltools_0.5.7

JSB-UCLA commented 10 months ago

Dear Chit Tong,

Thank you for your interest in scDEED. The error messages are due to the update of Seurat from V4 to V5. We are fixing this issue, and we will reply to you as soon as we fix it. Thanks.

On Tue, Nov 7, 2023 at 7:39 AM Chit Tong Lio @.***> wrote:

Hi scDeed authors,

Thank you for the cool tool. But it seems I couldn't get it run. I tried using the example data and data from 10x:

  1. When I run umap_example <- scDEED(input_counts , num_pc = 16, use_method = "umap",visualization = TRUE) it returned an error:

Error in identical(input_data, input_counts) && perplexity == c(seq(from = 20, : 'length = 22' in coercion to 'logical(1)' Traceback:

  1. scDEED(input_counts, num_pc = 16, use_method = "umap", visualization = TRUE)

  2. When I use data from 10x and run scDEED, it gave me this error, which is encountered after the message 'Optimization finished' :

Error in Permuted(input_data): no slot of name "scale.data" for this object of class "Assay5" Traceback:

  1. scDEED(brain_sce, num_pc = 15, use_method = "umap", visualization = TRUE)
  2. Permuted(input_data)

I was actually running this on a spatial data matrix (filtered): https://www.10xgenomics.com/resources/datasets/mouse-brain-serial-section-2-sagittal-anterior-1-standard-1-1-0

Thank you!

Best, Chit Tong

This is the R environment I am using:

R version 4.3.1 (2023-06-16) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Ubuntu 22.04.1 LTS

Matrix products: default BLAS/LAPACK: /nfs/home/students/chit/.conda/envs/nease/envs/sc/lib/ libopenblasp-r0.3.24.so; LAPACK version 3.11.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Berlin tzcode source: system (glibc)

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] devtools_2.4.5 usethis_2.2.2

loaded via a namespace (and not attached): [1] miniUI_0.1.1.1 compiler_4.3.1 crayon_1.5.2 promises_1.2.1 [5] Rcpp_1.0.11 stringr_1.5.0 callr_3.7.3 later_1.3.1 [9] fastmap_1.1.1 mime_0.12 R6_2.5.1 curl_5.1.0 [13] htmlwidgets_1.6.2 desc_1.4.2 profvis_0.3.8 rprojroot_2.0.4 [17] shiny_1.7.5.1 rlang_1.1.2 cachem_1.0.8 stringi_1.7.12 [21] httpuv_1.6.12 fs_1.6.3 pkgload_1.3.3 memoise_2.0.1 [25] cli_3.6.1 magrittr_2.0.3 ps_1.7.5 digest_0.6.33 [29] processx_3.8.2 xtable_1.8-4 remotes_2.4.2.1 lifecycle_1.0.4 [33] prettyunits_1.2.0 vctrs_0.6.4 glue_1.6.2 urlchecker_1.0.1 [37] sessioninfo_1.2.2 pkgbuild_1.4.2 purrr_1.0.2 tools_4.3.1 [41] ellipsis_0.3.2 htmltools_0.5.7

— Reply to this email directly, view it on GitHub https://github.com/JSB-UCLA/scDEED/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZP7H4GU7CZ7PX3IGY43C3YDJIZXAVCNFSM6AAAAAA7BMWUVSVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE4DCNRVGIYTONY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Best, Jessica


Jingyi Jessica Li (李婧翌), Ph.D.

Professor Department of Statistics and Data Science (Primary) Departments of Biostatistics, Computational Medicine, and Human Genetics (Secondary) University of California, Los Angeles http://jsb.ucla.edu Twitter: @jsb_ucla

clee700 commented 10 months ago

Hello, As Jessica mentioned, some changes in accessing data slots changed from Seurat V4 to V5. The attached file contains an updated R code that should fix it; we are still adding some things before we update github, but this should work to run scDEED.

Best, Christy scDEED_R.zip

kloot commented 6 months ago

Hi Christy, Thank you very much for scDEED and the updated code in scDEED.R. It does indeed run with Seurat v5 objects, however using your example data, I obtained a very different result than described in the vignette. Here is what I did:

# First we will load the data. scDEED uses Seurat objects. 
## No example data was provided in scDEED_R.zip; the Rdata loaded below was created in a separate R session by
#devtools::install_github("JSB-UCLA/scDEED")
#suppressPackageStartupMessages(library(scDEED))
#data(input_counts)
#saveRDS(input_counts, "input_counts.rds")

input_counts <- readRDS("input_counts.rds")
data = CreateSeuratObject(input_counts)
data <- NormalizeData(data)    # not in scDEED_R/tutorial.Rmd but likely required
data = ScaleData(data)
data = FindVariableFeatures(data)
data = RunPCA(data)
data
An object of class Seurat 
33391 features across 10000 samples within 1 assay 
Active assay: RNA (33391 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 1 dimensional reduction calculated: pca
Version(data)
[1] ‘5.0.1’
chooseK(data)
# looks somewhat different to plot in vignette
K <-7

ElbowPlot

umap_result = scDEED(data, K=K, reduction.method='umap', visualization=TRUE, n_neighbors=c(5,6,7,8,9,10), min.dist=0.1)
head(umap_result, 6)

image

Also, the runtime using the umap defaults for n_neighbours and min.dist is around 8 hours. I realise that this explores more parameter combinations than the example above, but still wonder whether several hours is 'expected behaviour' or whether it signal something going wrong. Unfortunately, using the the umap defaults also returns more than 6000 dubious cells at best

Many thanks in advance!!

clee700 commented 6 months ago

Hello, 1) Regarding the elbow plot, I'm not sure why it looks different than our previously saved results. The chooseK function that scDEED uses is really just the ElbowPlot function from Seurat. It could be due to different versions of Seurat. Mine now looks the same as yours and I will update our tutorial.

2) Regarding the large number of dubious cells; here I did find an error in my updated version of the code. Thank you for pointing this out! I have updated this in the attached file.

The number of dubious cells still does not match the tutorial exactly; it turns out that using the default Seurat RunPCA(object, npcs = 50) and then subsetting to K dimensions is not the same as RunPCA(object, npcs = K). This seems to be an implementation error since they should be the same. We have chosen to use RunPCA(object, npcs = K), which results in a slightly different number of dubious cells compared to previously.

3) Yes, the run time is unfortunately a bit long. 8 hours does not sound like something is going wrong. Unfortunately, recalculating the umap for the permuted/original data for many hyperparameter settings can be time intensive.

I have also made several modifications to the tutorial. scDEED.zip

Thank you!

kloot commented 6 months ago

Hi Christy,

Many thanks for the revised code, I can confirm that it identifies more realistic counts of 'dubious cells': For my data comprising 14,000 cells (not the example data), scDEED found 68 dubious cells at n_neighb==150 and min.dist==0.3, marginally fewer than for Seurat's RunUMAP default parameters (n_neigh==30, min_dist==0.3) where scDEED found 76 dubious cells.

However, I ran with scDEED with visualization=T' but no UMAPlot highlighting dubious cells was produced; I can' see the code blocks to generate are not included in (revised) scDEED.zip. So wonder whether you recommend to replace the parts of scDEED.R code in the package with the code in revised scDEED.zip?

Finally, can you confirm that the values in column 'dubious cells' of the 'results' df represent the row numbers of the input data?

Meanwhile, many thanks for all your efforts!!

PS: Don't worry about run time, it's fine when it completes OK. Just makes experimenting with code a little slow...

clee700 commented 6 months ago

Hello, Sorry for my own delay in getting the tutorial published on GitHub, but yes as you saw I removed the code block to produce the visualization. This way people can just plot the dubious cells for any hyperparameter set they may be interested in.

Yes, values in the column dubious cells represents the number of dubious cells at that hyperparameter setting, in a string separated by commas. These are actually the column names (the original data and Seurat objects are gene x cell).

You can plot the dubious cells this way:

result_umap = scDEED(data, K = K, reduction.method = 'umap', n_neighbors = c(5, 20, 30, 40, 50), min.dist = c(0.2, 0.6)) result_umap$num_dubious[result_umap$num_dubious$number.of.dubious.cells==min(result_umap$num_dubious$number.of.dubious.cells), ]

minimum number of dubious cells is n_neighbors 40 and min.dist 0.2

run at the optimized umap hyperparameter settings

input_data_optimized = RunUMAP(input_data, seed.use = 100, min.dist = 0.2, n.neighbors = 40, dims = 1:K)

dubious_cells = result_umap$full_results$dubious cells[which(result_umap$num_dubious$number.of.dubious.cells==min(result_umap$num_dubious$number.of.dubious.cells))] dubious_cells = as.numeric(strsplit(dubious_cells, ',')[[1]]) p = DimPlot(input_data_optimized, cells.highlight = list(dubious_cells))

Best, Christy

kloot commented 6 months ago

Dear Christy,

Thank you again for the pointers and explanation. May I trouble you with two follow-up questions and observations?

1) columns 'trustworthy cells' and 'intermediate cells' ofresult_umap[[2]] appear to have identical content (counts of row numbers/barcodes is the same, and equal to total cells - dubious cells). Below is an example (my own data, showing result_2_UMAP[[2]] in RStudio's data viewer). If this is expected, can you please let me know the definition of 'intermediate cells'? I assumed they were the 5..95% range as per your paper. image

2) just checking - your snippet result_umap$num_dubious[result_umap$num_dubious$number.of.dubious.cells==min(result_umap$num_dubious$number.of.dubious.cells), ] appears to return the record with the alphanumerically lowest value of dubious cells, but not the record with the numerically lowest value. Perhaps a datatype/sorting issue? Example for my data: image vs image

Getting there! Thanks for your patience! K

clee700 commented 6 months ago

Hi K, Thank you for your patience as well and sorry for all these errors…

  1. Yes I did find an error in the returned object, where instead of returning the intermediate cells, I again returned the trustworthy cells.
  2. I also think the ‘number of dubious cells’ was stored as a character, and I thought I had turned it back to numeric but I guess not. Thank you again for all your patience and help debugging the package scDEED.zip , I have fixed these errors here!

Best, Christy

kloot commented 6 months ago

Hi Christy,

No worries, and thanks for attending to it so quickly

Re 1), do I need to re-run the scDEED function (all 8 hours of it...) to obtain the correct cells? Also, the last line in the new zip file seems incomplete or misplaced?

Kind regards Klaus

clee700 commented 6 months ago

HI Klaus, Yes you are right, that last line was me troubleshooting the error, apologies. No I don’t think it is necessary to rerun. Your method of making the dubious cell column numeric works so it is not alphanumeric not numeric. The dubious and trustworthy cells are right, so to obtain the intermediate cells, if needed, this would be any cell that is not in the dubious or trustworthy list. The dubious embedding are probably the most important.

Best, Christy

kloot commented 6 months ago

Hi Christy,

Sorry to trouble you with bad news. I'm using scDEED.R from the zip archive you shared most recently (13 hours ago as I write this) I have delete the last line of code (scDEED.R now ends with ###original code on line 346)

I've run my data twice, and each time it produces this error:

Error in `$<-.data.frame`(`*tmp*`, "number of dubious cells", value = numeric(0)) : 
  replacement has 0 rows, data has 3

where data has n is the count of min.dist x n_neighbors combinations passed to scDEED. No output list is produced.

Independent of the error, can you please comment on my percentage of trustworthy cells (note this refers to results obtained with the second scDEED.zip archive you shared, before the most recent) For my data, the count of trustworthy cells is very close to all cells - dubious cells. In other words, I have very few intermediate cells. For example, the UMAP parameter combination resulting in the fewest dubious cells (68 cells / 0.49%) produces 13,172 (94.8%) trustworthy cells (13,895 cells in total) My cells are sorted by FACS and are not a highly diverse population. However, they typically split into 10-14 clusters, some of the clusters have ROGUE values of 0.8, and are defined by DE genes that are well-supported by published wet-lab results - so they are not perfectly homogeneous. The 68 dubious cells appear in small 'satellite' populations (which always split into 'early' clusters) and at the edges of larger populations/clusters, very similar to Fig 8b in your paper.

Curiously, I obtain 76 dubious cells with Seurat RunUMAP default parameters (min.dist 0.3, n_neighbours 30), while the (so far) best parameters suggested by scDEED result in 68 dubious cells but at much larger n_neighbours (min.dist 0.3, n_neighbours 150). I have explored around both parameter pairs and the pair identified by scDEED is a more robust minimum (varying scDEED's n.neighbours in small increments results in substantially lower dubios cells than varying Seurat default's by the same degree).

To me, these observations suggest that the scDEED approach works even if Seurat defaults were close (for this data set, I have another where they produced much higher dubious cell count than scDEED's parameters after just one round of optimisation)

BTW, have you ever succeeded in optimising umap parameters to drive dubious cells count to zero? Are min.dist and n_neighbours rounded prior to your calculations?

Sorry for the long post!

clee700 commented 6 months ago

Hi Klaus, Ahhh I see, actually I think the error is related to your comment on the number of trustworthy cells.

The cutoffs for dubious and trustworthy cells are based on the null distribution; thus the number of dubious and trustworthy cells may be much larger than 5%. Actually, in our simulation results, we found that often there are not that many intermediate cells; the large majority of cells are trustworthy. I think this makes sense because if UMAP and t-SNE were not able to produce low dimensional embeddings that reasonably represented the pre-embedding space, then no one would use it. We are just trying to improve their low embedding space via hyperparameter optimization and identification of distortions.

Below is a copy of Supplementary Fig. S17c from our paper, showing the number of intermediate cells (I have edited it here so that only the similarity percent =0.5, the default, is shown). At many perplexities, the number of intermediate cells is very low, actually 0. And this is what is causing the problem on the scDEED code, because now it cannot make the data frame when there are 0 intermediate cells. I didn’t think about that until you mentioned the error, I am sorry about that! I have updated this here. image

Yes, we also noted that around the minimum number of dubious cells, we find that the hyperparameter settings close to each other tend to be more stable. This is particularly true for t-SNE as perplexity increases (there are a few supplementary figures that show this). For UMAP it can be more random- the number of dubious cells can change drastically across hyperparameter settings. It is harder because there are two hyperparameters.

Also yes to reaching 0 dubious cells; the marrow dataset (Supplementary Fig. S13) we obtained 0 dubious cells. I think also for the PBMC dataset, on both UMAP and t-SNE. There were some other examples as well, and actually we found that through hyper parameter optimziation, it is possible for both t-SNE and UMAP to have the same amount of dubious cells, and they may both be 0. So if you are not sure about which one to use, or maybe UMAP has more dubious cells than you would like, then you could try t-SNE and see if it’s better. Although it does sound like you have a good result from UMAP already!

The last question I am not totally sure I am understanding- the n.neighbors and min.dist parameters are input directly into the RunUMAP function for Seurat. Per their website, n.neighbors should be between 5 to 50 and min.dist should be between 0.001 to 0.5 (although other websites have recommended differently, like this: https://pair-code.github.io/understanding-umap/. Our suggested UMAP parameters are to try to span the range of min.dist and n.neighbors without being too computationally expensive (we have changed the default to min.dist = c(0.1, 0.4) and n_neighbors = c(5, 20, 30, 40, 50) so there are only 10 pairs).

Thank you very much for all your interest! This has been very helpful! scDEED.zip

kloot commented 5 months ago

Hi Christy, Sorry for my late reply, I had to attend to other 'distractions' and also explored a lot more UMAP hyperparameter combinations with scDEED.

Overall, I found 5 very different combinations with similarly minimal counts of 'dubious' cells. 2* of the 5 combinations resulted in 'moving' of cells that were 'dubious' in the Seurat default UMAP and were located far away from the majority of the cells that were in the same cluster. Satisfyingly, these optimised hyperparameters 'moved' these cells into the UMAP area occupied by their cluster. However, this was achieved for 'only' 57 of the 90 cells whose scDEED label change from 'dubious' to 'intermediate'/'trustworthy'. But the Seurat defaults produced only 243 (1.7%) 'dubious cells, this appears to be a reasonable gain.

*one of these combinations was min.dist 0.3 and (extreme) n_neighbors 430 - the combination with the overall lowest 'dubious' count of 149. The UMAP does not look very different from those produced by combinations in the range (0.1....0.3/15...61)

I'm keen to see what happens with a more diverse data set ;-)

Thanks also for the link to the umap explanation page - very nice! FYI, the scDEED.R file in the zip archive attached to your reply still produces an error:

Error in `$<-.data.frame`(`*tmp*`, "number of dubious cells", value = numeric(0)) : 
  replacement has 0 rows, data has 1

Thanks you very much for all your help and guidance - I have enjoyed it very much

Kind regards K

clee700 commented 5 months ago

Hi K, Sorry for the very long delay; but I have done a lot of debugging and believe I found the issue. This has now been posted to the main branch, please reinstall scDEED using

devtools::install_github("JSB-UCLA/scDEED")

For datasets with a complex toplogy, which may be what you have, it may not be possible to obtain 0 dubious cells, so in this case the goal is just to minimize the number of dubious cells so the overall visual is more trustworthy (like you said, the decrease from 243 dubious cell embeddings could be a reasonable gain). Similar to you, we have also found that some hyperparameter settings can result in similar visualizations. Overall, for UMAP it can be much harder to predict, due to the algorithm and combination of two hyperparameters.

Thank you for sharing your experience with scDEED and helping debug the package!