DavisLaboratory / singscore

An R/Bioconductor package that implements a single-sample molecular phenotyping approach
https://davislaboratory.github.io/singscore/
40 stars 5 forks source link

generateNull unable to get reproducible results #21

Closed dbdimitrov closed 3 years ago

dbdimitrov commented 3 years ago

Dear Dharmesh,

I am a PhD student at Heidelberg University and I am interested in implementing singscore in a project of ours. However, I am experiencing some issues with generating reproducible p-values despite setting the seed to a predefined value. I also double checked and this appears to also be the case when following the code provided on the singscore Bioconductor page.

Please see the information about my R session below.

Thank you in advance for your time and response.

Kindest regards,

Daniel

R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.1 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] GSEABase_1.50.1 graph_1.66.0 annotate_1.66.0 XML_3.99-0.5 AnnotationDbi_1.50.3 IRanges_2.22.2 S4Vectors_0.26.1 Biobase_2.48.0
[9] BiocGenerics_0.34.0 singscore_1.8.0

loaded via a namespace (and not attached): [1] locfit_1.5-9.4 Rcpp_1.0.5 lattice_0.20-41 tidyr_1.1.2 digest_0.6.26 R6_2.4.1
[7] GenomeInfoDb_1.24.2 plyr_1.8.6 RSQLite_2.2.1 ggplot2_3.3.2 pillar_1.4.6 zlibbioc_1.34.0
[13] rlang_0.4.8 rstudioapi_0.11 blob_1.2.1 Matrix_1.2-18 stringr_1.4.0 RCurl_1.98-1.2
[19] bit_4.0.4 munsell_0.5.0 tinytex_0.26 DelayedArray_0.14.1 compiler_4.0.2 xfun_0.18
[25] pkgconfig_2.0.3 tidyselect_1.1.0 SummarizedExperiment_1.18.2 tibble_3.0.4 GenomeInfoDbData_1.2.3 edgeR_3.30.3
[31] matrixStats_0.57.0 crayon_1.3.4 dplyr_1.0.2 bitops_1.0-6 grid_4.0.2 xtable_1.8-4
[37] gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.0 magrittr_1.5 scales_1.1.1 stringi_1.5.3
[43] XVector_0.28.0 reshape2_1.4.4 limma_3.44.3 ellipsis_0.3.1 generics_0.0.2 vctrs_0.3.4
[49] tools_4.0.2 bit64_4.0.5 glue_1.4.2 purrr_0.3.4 colorspace_1.4-1 GenomicRanges_1.40.0
[55] memoise_1.1.0

bhuvad commented 3 years ago

Hi Daniel,

Thank you for using our method and apologies for the delay in responding. Would you mind sending my the code snippet from the example that you used to assess reproducibility? This will help speed-up assessment of the issue on my end and help me in fixing the bug quickly if there is one. I will aim to fix any bugs within a week.

Cheers, Dharmesh

dbdimitrov commented 3 years ago

Dear Dharmesh,

Thank you for your response! I have used the code provided in BioConductor with the the default data set and gene set for my example. What you will notice is that even though the seed is set to 1, each run of generateNull returns a different output. We assumed that it might be an OS-specific issue, but we observed it on both Linux and Mac.

Thank you in advance for your help.

Kindest regards,

Daniel

Singscore Example

Load prerequisites

library(singscore) library(GSEABase)

rankGenes and get singscores with provided data set

rankData <- rankGenes(tgfb_expr_10_se) scoredf <- simpleScore(rankData, upSet = tgfb_gs_up, downSet = tgfb_gs_dn)

1. First Run ----

run generateNull() with singscore provided gene set and data

permuteResult <- generateNull( upSet = tgfb_gs_up, downSet = tgfb_gs_dn, rankData = rankData, subSamples = 1:5, centerScore = TRUE, knownDirection = TRUE, B = 1000, ncores = 1, seed = 1, useBPPARAM = NULL ) head(permuteResult)

D_Ctrl_R1 D_TGFb_R1 D_Ctrl_R2 D_TGFb_R2 Hes_Ctrl_R1 1 0.03048780 0.04755790 0.02718137 0.04174301 0.03514975 2 -0.03317024 -0.02376684 -0.03265227 -0.01915243 -0.02831882 3 -0.02090430 -0.01900356 -0.01587744 -0.01892114 -0.01765831 4 -0.08828924 -0.08244022 -0.08791860 -0.08569179 -0.08839945 5 -0.02600196 -0.02770218 -0.02372295 -0.02441631 -0.02159512 6 -0.02008674 -0.02889647 -0.01170482 -0.02817693 -0.01562662

Get p-vals

pvals <- getPvals(permuteResult, scoredf, subSamples = 1:5)

D_Ctrl_R1 D_TGFb_R1 D_Ctrl_R2 D_TGFb_R2 Hes_Ctrl_R1 0.989 0.001 0.998 0.001 0.515

2. Second Run ----

run generateNull() with singscore provided gene set and data

permuteResult <- generateNull( upSet = tgfb_gs_up, downSet = tgfb_gs_dn, rankData = rankData, subSamples = 1:5, centerScore = TRUE, knownDirection = TRUE, B = 1000, ncores = 1, seed = 1, useBPPARAM = NULL ) head(permuteResult)

D_Ctrl_R1 D_TGFb_R1 D_Ctrl_R2 D_TGFb_R2 Hes_Ctrl_R1 1 -0.04238276 -0.03649935 -0.04094168 -0.03601640 -0.04223780 2 -0.03469320 -0.03136902 -0.03249239 -0.02857177 -0.02905655 3 0.02005870 0.02497933 0.01955014 0.01952887 0.02176913 4 0.03745370 0.03807967 0.03780389 0.03778464 0.04037655 5 -0.03255184 -0.03266313 -0.03384257 -0.03740144 -0.02979525 6 -0.01969287 -0.01979627 -0.02150899 -0.01823189 -0.01998047

Get p-vals

pvals <- getPvals(permuteResult, scoredf, subSamples = 1:5) pvals

D_Ctrl_R1 D_TGFb_R1 D_Ctrl_R2 D_TGFb_R2 Hes_Ctrl_R1 0.995 0.001 0.997 0.001 0.539

SessionInfo

R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.1 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_DE.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] GSEABase_1.50.1 graph_1.66.0 annotate_1.66.0 XML_3.99-0.5 AnnotationDbi_1.50.3 IRanges_2.22.2 S4Vectors_0.26.1
[8] Biobase_2.48.0 BiocGenerics_0.34.0 singscore_1.8.0

loaded via a namespace (and not attached): [1] SummarizedExperiment_1.18.2 locfit_1.5-9.4 tidyselect_1.1.0 purrr_0.3.4 reshape2_1.4.4
[6] lattice_0.20-41 colorspace_1.4-1 vctrs_0.3.4 generics_0.0.2 blob_1.2.1
[11] rlang_0.4.8 pillar_1.4.6 glue_1.4.2 DBI_1.1.0 BiocParallel_1.22.0
[16] bit64_4.0.5 matrixStats_0.57.0 GenomeInfoDbData_1.2.3 lifecycle_0.2.0 plyr_1.8.6
[21] stringr_1.4.0 zlibbioc_1.34.0 munsell_0.5.0 gtable_0.3.0 memoise_1.1.0
[26] GenomeInfoDb_1.24.2 Rcpp_1.0.5 xtable_1.8-4 edgeR_3.30.3 scales_1.1.1
[31] limma_3.44.3 DelayedArray_0.14.1 XVector_0.28.0 bit_4.0.4 ggplot2_3.3.2
[36] digest_0.6.26 stringi_1.5.3 dplyr_1.0.2 GenomicRanges_1.40.0 grid_4.0.2
[41] tools_4.0.2 bitops_1.0-6 magrittr_1.5 RCurl_1.98-1.2 RSQLite_2.2.1
[46] tibble_3.0.4 crayon_1.3.4 tidyr_1.1.2 pkgconfig_2.0.3 ellipsis_0.3.1
[51] Matrix_1.2-18 rstudioapi_0.11 R6_2.4.1 compiler_4.0.2

bhuvad commented 3 years ago

Hi Daniel,

I have dug into the issue and figured out what's going on. This is one of the things to do with BiocParallel. Basically, if you run multicore jobs, you would mostly use the SnowParam, DoparParam or the MulticoreParam backends. When using either of these and specifying ncores within the generateNull() to a number greater than 2, the function produces the same null distributions. This should produce reproducible distributions across different OS and backends.

However, when you run a single core job (such as by setting ncores = 1 or by using SerialParam), random numbers are not generated using the seed. In this case, one option is to use set.seed() within the function, however, Bioconductor strongly discourages calling that function within our functions. Instead, the recommended behaviour is to call set.seed() at the user-end, that is, outside the generateNull() function.

set.seed(10)
generateNull(...)

In summary, if you were to run single core jobs, you should set the seed and then call generateNull(). For multicore jobs (where ncores > 1), the random number generator works as is and no further modifications need to be performed from the user end. This behaviour is known and has been discussed quite a bit in the Bioconductor community. Following this observation, we will be modifying the documentation of the package to demonstrate how you can generate reproducible distributions when using single core jobs. Additionally, we will be throwing a warning for single core jobs to ensure users are aware that they need to set the seed externally to get reproducible distributions. Thank you for pointing out this problem and assisting us in enhancing the functionality of our package. We really appreciate it!

Cheers, Dharmesh

dbdimitrov commented 3 years ago

Hi Dharmesh,

Thank you for the informative response.

I have indeed tried what you suggested, but unfortunately I still get different results from each run. FYI, I tried setting the seed both within the scope of the function and the global env, as well as changing ncores and also different combinations of these.

Would it be possible for you to provide your code and your sessionInfo()? So that I double check if there is anything different in my environment?

Thank you in advance.

Cheers,

Daniel

bhuvad commented 3 years ago

Hi Daniel,

Here is the code you can use to produce reproducible null distributions. The first option is for single core jobs, whereby, you need to set the seed externally. For multicore jobs, the existing code works just as it is. Please try this code and let me know if this works.

Cheers, Dharmesh

library(singscore)

rankData <- rankGenes(tgfb_expr_10_se)

#singlecore jobs
set.seed(123)
x <-
  generateNull(
    upSet = tgfb_gs_up,
    downSet = tgfb_gs_dn,
    rankData = rankData,
    subSamples = 1:5,
    centerScore = TRUE,
    knownDirection = TRUE,
    B = 100,
    ncores = 1,
    seed = 10
  )
set.seed(123)
y <-
  generateNull(
    upSet = tgfb_gs_up,
    downSet = tgfb_gs_dn,
    rankData = rankData,
    subSamples = 1:5,
    centerScore = TRUE,
    knownDirection = TRUE,
    B = 100,
    ncores = 1,
    seed = 10
  )
all(x == y)

#multicore jobs
x <-
  generateNull(
    upSet = tgfb_gs_up,
    downSet = tgfb_gs_dn,
    rankData = rankData,
    subSamples = 1:5,
    centerScore = TRUE,
    knownDirection = TRUE,
    B = 100,
    ncores = 2,
    seed = 10
  )
y <-
  generateNull(
    upSet = tgfb_gs_up,
    downSet = tgfb_gs_dn,
    rankData = rankData,
    subSamples = 1:5,
    centerScore = TRUE,
    knownDirection = TRUE,
    B = 100,
    ncores = 2,
    seed = 10
  )
all(x == y)

Below is my session Info.

R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

Matrix products: default

Random number generation: RNG: L'Ecuyer-CMRG Normal: Inversion Sample: Rejection

locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 [4] LC_NUMERIC=C LC_TIME=English_Australia.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] singscore_1.11.0

loaded via a namespace (and not attached): [1] SummarizedExperiment_1.20.0 locfit_1.5-9.4 tidyselect_1.1.0 purrr_0.3.4
[5] reshape2_1.4.4 lattice_0.20-41 colorspace_1.4-1 vctrs_0.3.4
[9] generics_0.1.0 snow_0.4-3 stats4_4.0.3 blob_1.2.1
[13] XML_3.99-0.5 rlang_0.4.8 pillar_1.4.6 glue_1.4.2
[17] DBI_1.1.0 BiocParallel_1.24.1 BiocGenerics_0.36.0 bit64_4.0.5
[21] matrixStats_0.57.0 GenomeInfoDbData_1.2.4 lifecycle_0.2.0 plyr_1.8.6
[25] stringr_1.4.0 zlibbioc_1.36.0 MatrixGenerics_1.2.0 munsell_0.5.0
[29] gtable_0.3.0 memoise_1.1.0 Biobase_2.50.0 IRanges_2.24.0
[33] GenomeInfoDb_1.26.0 parallel_4.0.3 AnnotationDbi_1.52.0 GSEABase_1.52.0
[37] Rcpp_1.0.5 edgeR_3.32.0 xtable_1.8-4 scales_1.1.1
[41] limma_3.46.0 DelayedArray_0.16.0 S4Vectors_0.28.0 graph_1.68.0
[45] annotate_1.68.0 XVector_0.30.0 bit_4.0.4 ggplot2_3.3.2
[49] digest_0.6.27 stringi_1.5.3 dplyr_1.0.2 GenomicRanges_1.42.0
[53] grid_4.0.3 tools_4.0.3 bitops_1.0-6 magrittr_1.5
[57] RCurl_1.98-1.2 RSQLite_2.2.1 tibble_3.0.4 crayon_1.3.4
[61] tidyr_1.1.2 pkgconfig_2.0.3 ellipsis_0.3.1 Matrix_1.2-18
[65] httr_1.4.2 rstudioapi_0.12 R6_2.5.0 compiler_4.0.3

dbdimitrov commented 3 years ago

Hi Dharmesh,

I tried the code that you provided on Linux and Windows OS. On Windows, I got reproducible results for both use cases. On Linux, I only got reproducible results when I set the Global seed prior to each execution (as intended), but not when using ncores >=2.

I assume that there is little that we can do from here on, given that this depends on an external package.

I truly appreciate your time and I will use your input to attempt to include singscore appropriately in our project. Given your GitHub information (if up-to-date), it might be a project that could be of interest to you, so I could contact again you once we make it public.

Kindest regards,

Daniel