fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
614 stars 42 forks source link

Loading the fst packages initializes the RNG even though it is not used #251

Closed mlell closed 3 years ago

mlell commented 3 years ago

To make sure that all my scripts are reproducible, I ensure that they either receive a command line argument that sets the RNG seed via set.seed() or that they do not use the RNG. As the R variable .Random.seed is defined on the first RNG run in a session, I throw an error if .Random.seed exists but no SEED was provided to the script.

However, loading the fst packages creates .Random.seed even though the RNG is not used. I show this in the script below: Running sample() before and after loading the packages does not give a different result, but .Random.seed is created.

To provide a means of writing verifiably deterministic R scripts, fst should not initialize the RNG if it is not needed.

I installed the development version of fst using remotes::install_github("fstpackage/fst", ref = "59a18110"). This is the script I use:


RNG_status <- function(id){
  message(sprintf("%10s: .Random.seed exists: ", id), exists(".Random.seed"))
}

cmd <- commandArgs(TRUE)[1]
if(cmd == ".Random.seed"){

  RNG_status("begin")
  message("FST RNG tester")
  RNG_status("message")
  loadNamespace("dplyr")
  RNG_status("loadNamespace(dplyr)")
  loadNamespace("fst")
  RNG_status("loadNamespace(fst)")

}else if(cmd == "samplertest_1"){

  set.seed(123)
  message("Random number: ", sample(1:100, 1))

}else if(cmd == "samplertest_2"){

  set.seed(123)
  loadNamespace("fst")
  message("Random number: ", sample(1:100, 1))

}

And this is the script output in my shell ($ = my input)

$ Rscript test.R .Random.seed
     begin: .Random.seed exists: FALSE
FST RNG tester
   message: .Random.seed exists: FALSE
loadNamespace(dplyr): .Random.seed exists: FALSE
loadNamespace(fst): .Random.seed exists: TRUE
$ Rscript test.R samplertest_1
Random number: 31
$ Rscript test.R samplertest_2
Random number: 31

you can see that .Random.seed is set by the loadNamespace("fst") call but not for (example) loadNamespace("dplyr"). Moreover, a random number between 1 and 100 is chosen identically regardless of whether it is before or after the loadNamespace("fst") call.

MarcusKlik commented 3 years ago

Hi @mlell, thanks for submitting your issue!

Parameter .Random.seed is not set directly in fst, but perhaps it's set in some of it's dependencies? Also, I found that when I run your script in a fresh R session after starting rstudio, parameter .Random.seed is already initialized. But starting a new session from the R GUI apparently does not initialize .Random.seed, so something is going on with rstudio as well.

I'm not sure I understand your samplertest_1 and samplertest_2 tests, doesn't that show that you can set the random seed and it's not changed by loading the fst namespace?

(so loading the namespace only initializes the seed when it's not initialized yet?)

mlell commented 3 years ago

@MarcusKlik , samplertest_1 and samplertest_2 show that the RNG is not used by fst, no random number is actually generated during package init. This was meant to show that fst (or one of the dependencies) does not need the RNG for anything.

But the example above (".Random.seed") shows that it (or a dep) does initialize the RNG, in contrast to other packages like dplyr (i chose that because dplyr makes heavy use of C code. I suspected that the RNG is initialized by Rcpp or so, but apparently it isn't)

Yes I also suspected first that RStudio initializes the RNG, that's why I included the "negative controls" and ran the scripts in base R to eliminate this source of errors.

I hope I managed to be clearer this time...

riccardoporreca commented 3 years ago

@MarcusKlik, @mlell, this is due to the functions called .onLoad(): https://github.com/fstpackage/fstcore/blob/c96028e320e8d723f991ef6134dc2245b6e5f1b0/R/package.R#L209 Those are C++ functions exported to R using // [[Rcpp::export]], which are wrapped in RcppExports.cpp with code handling the RNG scope, which I believe implicitly initialize the internal state .Random.seed.

I did a quick check on a local checkout of branch master, replacing all // [[Rcpp::export]] // [[Rcpp::export(rng = false)]]: .Random.seed is then not created upon package loading. It might be a good idea to switch to [[Rcpp::export(rng = false)]], provided the Rcpp-exported functions never use the R random number generator. See the Rcpp-attributes vignette (https://github.com/RcppCore/Rcpp/blob/d6594bae57d796f0385bf9552532422e59a6e56d/vignettes/rmd/Rcpp-attributes.Rmd#L351).

A nice example of this is package qs, with rng = false set / not set e.g. in https://github.com/traversc/qs/blob/953af431514481a25cfe74093fa3a74220e653d8/src/extra_functions.h

NOTE: I am referring to the master branch of fst, develop is now based on fstcore, where the same considerations apply.

mlell commented 3 years ago

@riccardoporreca, thank you for this research! So is it as simple as adding this attribute, or should we include tests that make sure that the RNG is not used? But I don't know how to do that. First I thought of something similar like in the script above:

set.seed(123) # we know that the first random number in {1..100} with this seed will be 31
C_function_to_test()
expect_equal(sample(1:100, 1), 31)

But would that work? If a C function uses the RNG without initializing, maybe the RNG state will not be forwarded to R so the test cannot detect that? Did I understand that correctly?

Or am I overthinking this?

riccardoporreca commented 3 years ago

@mlell, I think one should just follow what the vignettes recommends

If you are certain that no C++ code will make use of random number generation and the 2ms of execution time is meaningful in your context, you can disable the automatic injection of RNGScope using the rng parameter of the Rcpp::export attribute

I doubt package maintainers would have to rely on unit tests to know if their C++ code uses RNG, and if that is the case (unless completely expected, like in C++ functions doing random things) this should be investigated and at least documented.

Overall I think we can summarize the matter as follows, which makes it ultimately a matter of choice for @MarcusKlik.

MarcusKlik commented 3 years ago

Hi @riccardoporreca, great, thanks a lot for clearing that up, I didn't know about Rcpp's automatic establishing of the RNGScope!

In that case, I think we can safely use the Rcpp::export(rng = false) option and claim those 2 ms of startup time :-)

MarcusKlik commented 3 years ago

Hi @mlell and @riccardoporreca, with the latest commit (in fstcore), the random seed is not initialized anymore, thanks a lot for the find and solution!

exists(".Random.seed")
#> [1] FALSE

require(fstcore)
#> Loading required package: fstcore

exists(".Random.seed")
#> [1] FALSE