adw96 / DivNet

diversity estimation under ecological networks
83 stars 18 forks source link

divnet fails silently if one of the parallelization packages is not installed. #68

Closed mmp3 closed 3 years ago

mmp3 commented 3 years ago

divnet fails silently if one of the parallelization packages is not installed.

> res <-  DivNet::divnet(psobj, X = "condition" , ncores = 8)
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
There were 36 warnings (use warnings() to see them)

Checking the warnings:

> warnings()
Warning messages:
1: In MCmat(Y = Y_p, W = W, eY = eY, N = N, Q = Q, base = base,  ... :
  Running in series; one of the packages doParallel, foreach or doSNOW is missing
2: In MCmat(Y = Y_p, W = W, eY = eY, N = N, Q = Q, base = base,  ... :
  Running in series; one of the packages doParallel, foreach or doSNOW is missing
3: In MCmat(Y = Y_p, W = W, eY = eY, N = N, Q = Q, base = base,  ... :
  Running in series; one of the packages doParallel, foreach or doSNOW is missing
...

and many of the entries of the res$"bray-curtis" matrix are ~1e-16.

It turns out that the doSNOW package was not installed on my system. After installing doSNOW, those warnings do not happen anymore and the res$"bray-curtis" matrix has entries >> 1e-16.

mooreryan commented 3 years ago

Here is a pretty long winded explanation of what is most likely going on for your issue. Hopefully you will find it helpful!

Running in series warnings

First, the warning Running in series; one of the packages doParallel, foreach or doSNOW is missing is not actually an error. It comes from this part of the code:

https://github.com/adw96/DivNet/blob/bea62385d3dfa5e9423f894b2049f4a4138ae54a/R/MCmat.R#L66

Here is the context:

    if (ncores > 1) {
      warning("Running in series; one of the packages doParallel, foreach or doSNOW is missing")
    }
    Y.MH <-  foreach(i = 1:N, .combine = 'acomb3', .multicombine = TRUE, .packages = "foreach") %do% {
      MH_path(i)
    }

So if you don't have those packages installed, but request more than one core, you trigger the warning but get the "series" version of the algorithm. So installing those packages and running again isn't what is actually "fixing" the warnings.

How R stores numbers

I think what is going on in your case is that you have run into a couple of interesting things about R. First, is the way R stores numbers. When you say you get numbers ~1e-16, I'm betting they were something like 1.110223e-16, right? Check this out:

> (0.1 * 6) - (6 / 10)
[1] 1.110223e-16
> 0.6 - (6/10)
[1] 0

You would expect from a math standpoint those two to be the same (ie 0) but numerically they are not (see https://stat.ethz.ch/pipermail/r-help/2012-March/307139.html for more).

If you look at the .Machine var in R, you will get a lot of info about the numerical characteristics of the machine that R is running on. As far as I know, R is required to work with IEEE 754 floating point numbers (see https://stat.ethz.ch/R-manual/R-devel/library/base/html/double.html). These have 53 significant bits, so you get 15 or 16 significant decimal digits (see https://en.wikipedia.org/wiki/IEEE_754#Basic_and_interchange_formats).

Given that, check out some numbers from .Machine (on my machine at least).

> .Machine$double.base
[1] 2
> .Machine$double.neg.ulp.digits
[1] -53
> .Machine$double.base ^ .Machine$double.neg.ulp.digits
[1] 1.110223e-16
> .Machine$double.neg.eps
[1] 1.110223e-16

So basically, depending on how things shake out you can get numerical values of 1.110223e-16 numerically when logically you would expect zero. It similar to how you often get p-values < 2.2e-16 reported by R (see https://stats.stackexchange.com/a/78840).

So why did the numbers change after you reran with doSNOW?

Okay finally, why did the numbers change to zeros when you reran after installing doSNOW. Here is the second interesting thing you've run into: how R deals with random number generation. Here is an example from my computer which does have doSNOW.

library(doSNOW)
library(DivNet)
data(Lee)

seed <- 12345

set.seed(seed)
divnet_phylum <-  divnet(tax_glom(Lee, taxrank="Phylum"),
                         X = "char",
                         ncores = 1,
                         B = 2)
a <- divnet_phylum[["bray-curtis"]][c("BW1", "R10", "R11BF", "R2"), "R6"]

set.seed(seed)
divnet_phylum <-  divnet(tax_glom(Lee, taxrank="Phylum"),
                         X = "char",
                         ncores = 1,
                         B = 2)
b <- divnet_phylum[["bray-curtis"]][c("BW1", "R10", "R11BF", "R2"), "R6"]

set.seed(seed)
divnet_phylum <-  divnet(tax_glom(Lee, taxrank="Phylum"),
                         X = "char",
                         ncores = 2,
                         B = 2)
c <- divnet_phylum[["bray-curtis"]][c("BW1", "R10", "R11BF", "R2"), "R6"]

set.seed(seed)
divnet_phylum <-  divnet(tax_glom(Lee, taxrank="Phylum"),
                         X = "char",
                         ncores = 2,
                         B = 2)
d <- divnet_phylum[["bray-curtis"]][c("BW1", "R10", "R11BF", "R2"), "R6"]

a
b
c
d

And here is the output of that

> a
         BW1          R10        R11BF           R2 
2.212139e-01 1.655468e-01 3.063938e-01 1.110223e-16 
> b
         BW1          R10        R11BF           R2 
2.212139e-01 1.655468e-01 3.063938e-01 1.110223e-16 
> c
      BW1       R10     R11BF        R2 
0.2200620 0.1649343 0.3072189 0.0000000 
> d
      BW1       R10     R11BF        R2 
0.2156653 0.1688249 0.3038798 0.0000000 

So what's going on? Notice how before each run I use set.seed? For a and b I run divnet with 1 core. This makes the foreach code run serially. So the random seed is used and a and b are the same since I reset it before running divnet function. For the next two, I use 2 cores, which uses the parallel logic instead. Now it is a quirk of how that works that the seed set in the top level is not the one used by the individual cores in the cluster. (This is the reason a package like doRNG was created). So even though I use set.seed before each call to divnet it only works in the first two cases. See how the c and d vectors are different from each other and from the a and b.

Also notice in the first run, the R2 vs R2 value is 1.110223e-16 rather than 0. That's the numerical precision stuff again.

Summary

The random seed inside the parallel version of foreach isn't working properly. Fixing it isn't exactly straight forward, but one way is using a package like doRNG to manage random seeds inside the cluster. (pinging @adw96 & @bryandmartin)

adw96 commented 3 years ago

Great question @mmp3 - and a huge thank you for the comprehensive and perfect response @mooreryan ! I have very little to add except that the actual impact of the nasty irreproducible-multiple-core-random-seed issue can be somewhat mitigated by increasing the accuracy of the maximum likelihood search, including by setting tuning to "fast", or better yet, increasing EMiter and MCiter. To increase the accuracy of the estimated variances, you can increase B (for the parametric bootstrap option) or nsub (for the nonparametric bootstrap option).

The trade off for retaining more accuracy (and more robustness to Monte Carlo variation) is time - both the above will increase the runtime of divnet.

Thanks for the ping @mooreryan - at this time it's not a priority for us to fix this random seed issue (and we are a little stretched for resources - as you can tell by the date stamps), but we welcome help from anyone on this!

Thank you both for your great questions and answers.