andreamrau / coseq

Co-Expression Analysis of Sequencing Data
6 stars 2 forks source link

Reproducible results with seed and parallel arguments #3

Closed OceaneCsn closed 3 years ago

OceaneCsn commented 3 years ago

First, thanks for the amazing package!

After a brief exchange with Andrea (at the Netbio talks), I experimented a bit more with seeds and reproducibility in coseq. Even when passing a seed argument to the latest version of the coseq() function, I could not get identical coseq runs. Turns out the parallel option might get in the way of reproducibility, at least for some of the models implemented in coseq. Here is what I get :

(I used the correlation between the cluster assignments from two runs with identical parameters, that should be 1 for identical random states, the expected behavior)

------------------ run once -------------------------
library(coseq)
set.seed(12345)
countmat <- matrix(runif(300*4, min=0, max=500), nrow=300, ncol=4)
countmat <- countmat[which(rowSums(countmat) > 0),]
----------------------------------------------------- 

## Normal mixture model for K = 2,3,4 single thread
run_arcsin1 <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin",
                    model="Normal", seed=12345, verbose = FALSE)

run_arcsin2 <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin",
                    model="Normal", seed=12345, verbose = FALSE)
cor(clusters(run_arcsin1), clusters(run_arcsin2))
[1] 1

## Poisson mixture model for K = 2,3,4 single thread
run_pois1 <- coseq(object=round(countmat), K=2:4, iter=5, transformation="none",
                     model="Poisson", seed=12345, parallel = FALSE, verbose = FALSE)

run_pois2 <- coseq(object=round(countmat), K=2:4, iter=5, transformation="none",
                     model="Poisson", seed=12345, parallel = FALSE, verbose = FALSE)
cor(clusters(run_pois1), clusters(run_pois2))
[1] 1

## kmeans model single thread
runLogCLR1 <- coseq(countmat, K=2:25, transformation="logclr",norm="TMM", 
                    meanFilterCutoff=10, model="kmeans",
                    nstart=1, iter.max=10, parallel = FALSE, seed=12345)

runLogCLR2 <- coseq(countmat, K=2:25, transformation="logclr",norm="TMM", 
                    meanFilterCutoff=10, model="kmeans",
                    nstart=1, iter.max=10, parallel = FALSE, seed=12345)
cor(clusters(runLogCLR1), clusters(runLogCLR2))
[1] 1
## Run the Normal mixture model for K = 2,3,4 parallel
run_arcsin1 <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin",
                     model="Normal", seed=12345, parallel = TRUE)

run_arcsin2 <- coseq(object=countmat, K=2:4, iter=5, transformation="arcsin",
                     model="Normal", seed=12345, parallel = TRUE)
cor(clusters(run_arcsin1), clusters(run_arcsin2))
[1] 1

## Run the Poisson mixture model for K = 2,3,4 parallel
run_pois1 <- coseq(object=round(countmat), K=2:4, iter=5, transformation="none",
                     model="Poisson", seed=12345, parallel = TRUE)

run_pois2 <- coseq(object=round(countmat), K=2:4, iter=5, transformation="none",
                     model="Poisson", seed=12345, parallel = TRUE)
cor(clusters(run_pois1), clusters(run_pois2))
[1] 0.5751166

## Run the kmeans model parallel
runLogCLR1 <- coseq(countmat, K=2:25, transformation="logclr",norm="TMM", 
                   meanFilterCutoff=10, model="kmeans",
                   nstart=1, iter.max=10, parallel = TRUE, seed=12345)

runLogCLR2 <- coseq(countmat, K=2:25, transformation="logclr",norm="TMM", 
                    meanFilterCutoff=10, model="kmeans",
                    nstart=1, iter.max=10, parallel = TRUE, seed=12345)
cor(clusters(runLogCLR1), clusters(runLogCLR2))
[1] -0.1066347

Can you reproduce this behavior?

If so, I don't know if there's a fix, but I thought I should share this in case it had gone unnoticed. (Maybe looking into biocParallel options for multithreading procedures and seed setting could be helpful).

Bests regards


Session info :

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] coseq_1.14.0                SummarizedExperiment_1.20.0 Biobase_2.50.0              GenomicRanges_1.42.0       
 [5] GenomeInfoDb_1.26.2         IRanges_2.24.0              S4Vectors_0.28.0            BiocGenerics_0.36.0        
 [9] MatrixGenerics_1.2.0        matrixStats_0.57.0          DIANE_0.99.1                shiny_1.5.0                

loaded via a namespace (and not attached):
  [1] tidyr_1.1.2              ggplot2_3.3.2            bit64_4.0.5              knitr_1.30              
  [5] DelayedArray_0.16.0      data.table_1.13.4        rpart_4.1-15             RCurl_1.98-1.2          
  [9] doParallel_1.0.16        generics_0.1.0           snow_0.4-3               callr_3.5.1             
 [13] cowplot_1.1.0            shinybusy_0.2.2          lambda.r_1.2.4           usethis_2.0.0           
 [17] RSQLite_2.2.1            shadowtext_0.0.7         org.At.tair.db_3.12.0    future_1.21.0           
 [21] bit_4.0.4                enrichplot_1.10.1        bayesm_3.1-4             spatstat.data_1.5-2     
 [25] xml2_1.3.2               httpuv_1.5.4             assertthat_0.2.1         viridis_0.5.1           
 [29] xfun_0.19                hms_0.5.3                evaluate_0.14            promises_1.1.1          
 [33] DEoptimR_1.0-8           fansi_0.4.1              progress_1.2.2           igraph_1.2.6            
 [37] DBI_1.1.0                geneplotter_1.68.0       htmlwidgets_1.5.2        futile.logger_1.4.3     
 [41] tensorA_0.36.2           purrr_0.3.4              ROC_1.66.0               ellipsis_0.3.1          
 [45] crosstalk_1.1.0.1        corrplot_0.84            dplyr_1.0.2              markdown_1.1            
 [49] annotate_1.68.0          compositions_2.0-0       deldir_0.2-3             vctrs_0.3.5             
 [53] remotes_2.2.0            here_1.0.0               abind_1.4-5              withr_2.3.0             
 [57] ggforce_0.3.2            loggit_2.1.0             robustbase_0.93-6        prettyunits_1.1.1       
 [61] goftest_1.2-2            DOSE_3.16.0              lazyeval_0.2.2           crayon_1.3.4            
 [65] HTSCluster_2.0.8         genefilter_1.72.0        labeling_0.4.2           units_0.6-7             
 [69] edgeR_3.32.0             pkgconfig_2.0.3          tweenr_1.0.1             nlme_3.1-150            
 [73] pkgload_1.1.0            rlang_0.4.9              globals_0.14.0           lifecycle_0.2.0         
 [77] downloader_0.4           rfPermute_2.1.81         Rmixmod_2.1.5            VennDiagram_1.6.20      
 [81] randomForest_4.6-14      rprojroot_2.0.2          polyclip_1.10-0          rngtools_1.5            
 [85] Matrix_1.2-18            ggridges_0.5.2           processx_3.4.5           pheatmap_1.0.12         
 [89] viridisLite_0.3.0        bitops_1.0-6             shinydashboard_0.7.1     KernSmooth_2.23-18      
 [93] visNetwork_2.0.9         blob_1.2.1               classInt_0.4-3           doRNG_1.8.2             
 [97] stringr_1.4.0            qvalue_2.22.0            shinydashboardPlus_0.7.5 parallelly_1.21.0       
[101] scales_1.1.1             memoise_1.1.0            magrittr_2.0.1           plyr_1.8.6              
[105] zlibbioc_1.36.0          compiler_4.0.3           scatterpie_0.1.5         RColorBrewer_1.1-2      
[109] plotrix_3.7-8            DESeq2_1.30.0            cli_2.2.0                ade4_1.7-16             
[113] XVector_0.30.0           listenv_0.8.0            ps_1.5.0                 formatR_1.7             
[117] MASS_7.3-53              mgcv_1.8-33              tidyselect_1.1.0         stringi_1.5.3           
[121] yaml_2.2.1               GOSemSim_2.16.1          dockerfiler_0.1.3        locfit_1.5-9.4          
[125] ggrepel_0.8.2            grid_4.0.3               fastmatch_1.1-0          tools_4.0.3             
[129] rstudioapi_0.13          foreach_1.5.1            gridExtra_2.3            farver_2.0.3            
[133] ggraph_2.0.4             digest_0.6.27            rvcheck_0.1.8            BiocManager_1.30.10     
[137] Rcpp_1.0.5               later_1.1.0.1            shinyWidgets_0.5.4       httr_1.4.2              
[141] AnnotationDbi_1.52.0     sf_0.9-6                 HTSFilter_1.30.0         GENIE3_1.12.0           
[145] colorspace_2.0-0         XML_3.99-0.5             fs_1.5.0                 tensor_1.5              
[149] TCC_1.30.0               splines_4.0.3            ggVennDiagram_0.3        mapdata_2.3.0           
[153] swfscMisc_1.3            spatstat.utils_1.17-0    graphlayouts_0.7.1       plotly_4.9.2.1          
[157] shinyalert_2.0.0         xtable_1.8-4             jsonlite_1.7.1           futile.options_1.0.1    
[161] baySeq_2.24.0            spatstat_1.64-1          tidygraph_1.2.0          testthat_3.0.0          
[165] R6_2.5.0                 pillar_1.4.7             htmltools_0.5.0          mime_0.9                
[169] tictoc_1.0               glue_1.4.2               fastmap_1.0.1            clusterProfiler_3.18.0  
[173] DT_0.16                  BiocParallel_1.24.1      DESeq_1.39.0             class_7.3-17            
[177] capushe_1.1.1            codetools_0.2-18         maps_3.3.0               fgsea_1.16.0            
[181] pkgbuild_1.1.0           mvtnorm_1.1-1            lattice_0.20-41          tibble_3.0.4            
[185] attempt_0.3.1            dashboardthemes_1.1.3    config_0.3               GO.db_3.12.1            
[189] golem_0.2.1              survival_3.2-7           limma_3.46.0             roxygen2_7.1.1          
[193] rmarkdown_2.5            desc_1.2.0               munsell_0.5.0            e1071_1.7-4             
[197] DO.db_2.9                GenomeInfoDbData_1.2.4   iterators_1.0.13         reshape2_1.4.4          
[201] gtable_0.3.0     
andreamrau commented 3 years ago

Thanks for the bug report @OceaneCsn! I will look into this and try to figure out what's happening with the Poisson model + K-means when running parallel = TRUE, I'll let you know once (if?) I figure out what is going on.

andreamrau commented 3 years ago

Hi @OceaneCsn, I've fixed the issue. I had forgotten to pass the seed argument to the parallel processes for the Poisson/K-means cases, and that has now been fixed in the latest version (1.15.6 -> currently available in the master branch of this GitHub repo, it will probably take at least a day or two to show up on Bioconductor devel).

Let me know if you notice any other bugs, I appreciate your feedback!

OceaneCsn commented 3 years ago

Thank you @andreamrau for fixing this so quickly! I'll let you known if I find anything else :)