MangiolaLaboratory / sccomp

Testing differences in cell type proportions from single-cell data.
https://stemangiola.github.io/sccomp/
84 stars 7 forks source link

Inconsistent results between seeds #135

Open michaelplynch opened 1 month ago

michaelplynch commented 1 month ago

Hi Stefano, Flagging some interesting behaviour I've come across with the package.

Problem: The 1d credible intervals plot is showing different results in terms of rank, credible interval and FDR significance when run with different seeds. I see the "EXPERIMENTAL ALGORITHM" message, is this a byproduct? I'm working off sccomp v1.8 from the most recent Bioconductor release. I also notice on the Bioconductor vignette that for the boxplot all cell types are coloured as significant for c_typecancer, which looks inconsistent with the credible interval plot. Is this also a bug or have I misinterpreted the plot?

Dataset: Working with a SingleCellExperiment object, happy to share privately if you want to replicate. Code: `set.seed(3) sccomp_result = sce_pm |> sccomp_estimate( formula_composition = ~ type, .sample = sample_id, .cell_group = celltype_num, bimodal_mean_variability_association = TRUE, cores = 4, max_sampling_iterations = 200000 ) |> sccomp_remove_outliers(cores = 1) |> # Optional sccomp_test()

sccomp_result

plots = sccomp_result |> plot() plots$boxplot plots$credible_intervals_1D

set.seed(1) sccomp_result = sce_pm |> sccomp_estimate( formula_composition = ~ type, .sample = sample_id, .cell_group = celltype_num, bimodal_mean_variability_association = TRUE, cores = 4, max_sampling_iterations = 200000 ) |> sccomp_remove_outliers(cores = 1) |> # Optional sccomp_test()

sccomp_result

plots = sccomp_result |> plot() plots$boxplot plots$credible_intervals_1D` Output: image image

sessionInfo() R version 4.4.0 (2024-04-24) Platform: x86_64-pc-linux-gnu Running under: Red Hat Enterprise Linux 9.1 (Plow)

Matrix products: default BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP; LAPACK version 3.9.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Dublin tzcode source: system (glibc)

attached base packages: [1] stats4 stats graphics grDevices utils datasets [7] methods base

other attached packages: [1] sccomp_1.8.0
[2] ggplotify_0.1.2
[3] dittoSeq_1.16.0
[4] Seurat_5.1.0
[5] SeuratObject_5.0.2
[6] sp_2.1-4
[7] SCpubr_2.0.2
[8] ggpubr_0.6.0
[9] ggplot2_3.5.1
[10] dplyr_1.1.4
[11] edgeR_4.2.0
[12] limma_3.60.2
[13] SingleCellExperiment_1.26.0
[14] SummarizedExperiment_1.34.0
[15] Biobase_2.64.0
[16] GenomicRanges_1.56.0
[17] GenomeInfoDb_1.40.1
[18] IRanges_2.38.0
[19] S4Vectors_0.42.0
[20] BiocGenerics_0.50.0
[21] MatrixGenerics_1.16.0
[22] matrixStats_1.3.0
[23] ccrccprimarymetsanalysis_0.0.0.9000 [24] devtools_2.4.5
[25] usethis_2.2.3

loaded via a namespace (and not attached): [1] fs_1.6.4 spatstat.sparse_3.0-3
[3] httr_1.4.7 RColorBrewer_1.1-3
[5] profvis_0.3.8 tools_4.4.0
[7] sctransform_0.4.1 backports_1.5.0
[9] utf8_1.2.4 R6_2.5.1
[11] ResidualMatrix_1.14.0 lazyeval_0.2.2
[13] uwot_0.2.2 urlchecker_1.0.1
[15] withr_3.0.0 gridExtra_2.3
[17] progressr_0.14.0 cli_3.6.2
[19] spatstat.explore_3.2-7 fastDummies_1.7.3
[21] labeling_0.4.3 spatstat.data_3.0-4
[23] readr_2.1.5 ggridges_0.5.6
[25] pbapply_1.7-2 QuickJSR_1.2.2
[27] yulab.utils_0.1.4 StanHeaders_2.32.9
[29] parallelly_1.37.1 sessioninfo_1.2.2
[31] rstudioapi_0.16.0 generics_0.1.3
[33] gridGraphics_0.5-1 ica_1.0-3
[35] spatstat.random_3.2-3 car_3.1-2
[37] inline_0.3.19 loo_2.7.0
[39] Matrix_1.7-0 fansi_1.0.6
[41] abind_1.4-5 lifecycle_1.0.4
[43] yaml_2.3.8 carData_3.0-5
[45] SparseArray_1.4.8 Rtsne_0.17
[47] grid_4.4.0 promises_1.3.0
[49] dqrng_0.4.1 crayon_1.5.2
[51] miniUI_0.1.1.1 lattice_0.22-6
[53] beachmat_2.20.0 cowplot_1.1.3
[55] demuxmix_1.6.0 pillar_1.9.0
[57] knitr_1.47 metapod_1.12.0
[59] boot_1.3-30 future.apply_1.11.2
[61] codetools_0.2-20 leiden_0.4.3.1
[63] glue_1.7.0 data.table_1.15.4
[65] remotes_2.5.0 vctrs_0.6.5
[67] png_0.1-8 spam_2.10-0
[69] gtable_0.3.5 assertthat_0.2.1
[71] cachem_1.1.0 xfun_0.44
[73] S4Arrays_1.4.1 mime_0.12
[75] survival_3.5-8 pheatmap_1.0.12
[77] statmod_1.5.0 bluster_1.14.0
[79] ellipsis_0.3.2 fitdistrplus_1.1-11
[81] ROCR_1.0-11 nlme_3.1-164
[83] RcppAnnoy_0.0.22 rstan_2.32.6
[85] irlba_2.3.5.1 KernSmooth_2.23-22
[87] colorspace_2.1-0 tidyselect_1.2.1
[89] compiler_4.4.0 BiocNeighbors_1.22.0
[91] DelayedArray_0.30.1 plotly_4.10.4
[93] scales_1.3.0 lmtest_0.9-40
[95] stringr_1.5.1 digest_0.6.35
[97] goftest_1.2-3 spatstat.utils_3.0-4
[99] rmarkdown_2.27 XVector_0.44.0
[101] htmltools_0.5.8.1 pkgconfig_2.0.3
[103] sparseMatrixStats_1.16.0 fastmap_1.2.0
[105] rlang_1.1.4 htmlwidgets_1.6.4
[107] UCSC.utils_1.0.0 shiny_1.8.1.1
[109] DelayedMatrixStats_1.26.0 farver_2.1.2
[111] zoo_1.8-12 jsonlite_1.8.8
[113] BiocParallel_1.38.0 BiocSingular_1.20.0
[115] magrittr_2.0.3 scuttle_1.14.0
[117] GenomeInfoDbData_1.2.12 dotCall64_1.1-1
[119] patchwork_1.2.0 munsell_0.5.1
[121] Rcpp_1.0.12 viridis_0.6.5
[123] reticulate_1.37.0 stringi_1.8.4
[125] zlibbioc_1.50.0 MASS_7.3-60.2
[127] plyr_1.8.9 pkgbuild_1.4.4
[129] parallel_4.4.0 listenv_0.9.1
[131] ggrepel_0.9.5 forcats_1.0.0
[133] deldir_2.0-4 splines_4.4.0
[135] tensor_1.5 hms_1.1.3
[137] locfit_1.5-9.9 igraph_2.0.3
[139] spatstat.geom_3.2-9 ggsignif_0.6.4
[141] RcppHNSW_0.6.0 reshape2_1.4.4
[143] ScaledMatrix_1.12.0 pkgload_1.3.4
[145] rstantools_2.4.0 evaluate_0.24.0
[147] BiocManager_1.30.23 RcppParallel_5.1.7
[149] scran_1.32.0 tzdb_0.4.0
[151] httpuv_1.6.15 batchelor_1.20.0
[153] RANN_2.6.1 tidyr_1.3.1
[155] purrr_1.0.2 polyclip_1.10-6
[157] future_1.33.2 scattermore_1.2
[159] rsvd_1.0.5 broom_1.0.6
[161] xtable_1.8-4 RSpectra_0.16-1
[163] rstatix_0.7.2 later_1.3.2
[165] viridisLite_0.4.2 tibble_3.2.1
[167] memoise_2.0.1 cluster_2.1.6
[169] globals_0.16.3

Thanks!

stemangiola commented 1 month ago

Yes please share sce_pm at stefano.mangiola@adelaide.edu.au

Thanks.

stemangiola commented 1 month ago

I also notice on the Bioconductor vignette that for the boxplot all cell types are coloured as significant for c_typecancer, which looks inconsistent with the credible interval plot

I'll have a look tomorrow morning.

stemangiola commented 4 weeks ago

Hello, please try the branch https://github.com/MangiolaLaboratory/sccomp/tree/allow-vb-output-samples

I have allowed sampling iterations > 1000 for vb

Also I implemented the more modern backend with the fast and very reliable pathfinder, please try it out as well

https://github.com/MangiolaLaboratory/sccomp/tree/cmdstanr

stemangiola commented 4 weeks ago

VB with custom number of draws

image

full-bayes HMC

set.seed(3)
sccomp_result =
  sccomp_test_sce |>
  sccomp_estimate(
    formula_composition = ~ type,
    .sample = sample_id,
    .cell_group = celltype_num,
    bimodal_mean_variability_association = TRUE,
    variational_inference = F
  ) |>
  sccomp_remove_outliers(variational_inference = F) |> # Optional
  sccomp_test()

sccomp_result

plots = sccomp_result |> plot()
#plots$boxplot
#plots$credible_intervals_1D

set.seed(1)
sccomp_result =
  sccomp_test_sce |>
  sccomp_estimate(
    formula_composition = ~ type,
    .sample = sample_id,
    .cell_group = celltype_num,
    bimodal_mean_variability_association = TRUE,
    variational_inference = F
  ) |>
  sccomp_remove_outliers(variational_inference = F) |> # Optional
  sccomp_test()

sccomp_result

plots2 = sccomp_result |> plot()
#plots$boxplot

plots$credible_intervals_1D + plots2$credible_intervals_1D
image

Now the new cmdstanr backend (updated now, please reinstall), using the fast pathfinder in this case (HMC is always available)

set.seed(3)
sccomp_result =
    sccomp_test_sce |>
    sccomp_estimate(
        formula_composition = ~ type,
        .sample = sample_id,
        .cell_group = celltype_num,
        bimodal_mean_variability_association = TRUE
    ) |>
    sccomp_remove_outliers() |> # Optional
    sccomp_test()

sccomp_result

plots = sccomp_result |> plot()
#plots$boxplot
#plots$credible_intervals_1D

set.seed(1)
sccomp_result =
    sccomp_test_sce |>
    sccomp_estimate(
        formula_composition = ~ type,
        .sample = sample_id,
        .cell_group = celltype_num,
        bimodal_mean_variability_association = TRUE
    ) |>
    sccomp_remove_outliers() |> # Optional
    sccomp_test()

sccomp_result

plots2 = sccomp_result |> plot()
#plots$boxplot

plots$credible_intervals_1D + plots2$credible_intervals_1D
image
stemangiola commented 4 weeks ago

Having tested this a little, I would say, use

variational_bayes = FALSE,

it is the gold standard and gives the most consistent results.

I will investigate further the variational strategy.

michaelplynch commented 4 weeks ago

Hi Stefano, Can you confirm that it's the cmdstanr branch I should be reinstalling on? On Windows I get an install error similar to your GHA. On Linux oddly enough it will install but errors with "Further attempt with Variational Bayes: Error: Model not compiled. Try running the compile() method first."

stemangiola commented 4 weeks ago

You can

michaelplynch commented 3 weeks ago

Thanks Stefano.

Assuming you mean variational_inference=F (rather than variational_bayes=F). variational_bayes=F returns an unused argument error.

For master branch, variational_inference=F looks more consistent. For cmdstanr branch, inference_method="pathfinder" also looks more consistent.

Other feedback: I'm not sure was the pathfinder implementation much faster than HMC on the master branch in this case but I haven't formally tested this. For cmdstanr branch with HMC e.g.: sccomp_result = sccomp_test_sce |> sccomp_estimate( formula_composition = ~ type, .sample = sample_id, .cell_group = celltype_num, bimodal_mean_variability_association = TRUE, inference_method = "HMC" ) |> sccomp_remove_outliers() |> # Optional sccomp_test() and sccomp_result = sccomp_test_sce |> sccomp_estimate( formula_composition = ~ type, .sample = sample_id, .cell_group = celltype_num, bimodal_mean_variability_association = TRUE, inference_method = "HMC", variational_inference = FALSE ) |> sccomp_remove_outliers() |> # Optional sccomp_test() return errors Error in vb_iterative(mod, output_samples = output_samples, iter = 10000, : sccomp says: variational Bayes did not converge after 5 attempts. Please use variational_inference = FALSE for a HMC fitting. and Error in fit2$num_chains() : attempt to apply non-function, at least with this dataset.