Closed joethorley closed 5 years ago
I like the idea.
Here is another one if we want more flexibility: we can use an engine= argument instead that would default to R but the user can say engine="jags" when they want to simulate with jags. That would give us more flexibility in case we ever want to simulate with STAN or anything else.
Then if an error ensues because the user didn't specify engine=JAGS, the code is checked to see if there is a ~, and warning is printed to say that jags was used?
I would be cool to use STAN etc. It may be possible to identify all the engines we need from the code structures so to keep things as elegant as possible I think we should hold off on the engine argument until we hit a case where we need it.
it's not going to work with R code or even STAN code because we need the <- and ~ to determine stochasticity unless we save the names of all possible functions that generate random numbers.
Instead on focusing on stochasticity, we could focus on "data". For example, in R we would define data as anything that is on the left hand side of <- or = but that is never used on the right hand side.
There would be no point for a user to define a variable on the left hand side if it is not later going to be used in the data generation process, unless it is data or some sort of constant that they are interested in.
So basically it seems to me that whatever the engine, we can ensure that we output all the data and derived constants. Anything else (something defined on the left and reused on the right) would be considered a latent variable. (Technically this could be a constant so we might want to rethink the choice of the term "latent", perhaps "intermediate")
Am I missing something?
In R, the operator %<>% might complicate things a bit as it implies that something that only appears on the left was implicitely reused on the right but that could be handled as well.
I am now seing that you've included an argument stochastic= and we could probably keep it but it would only apply to jags code
That is correct. However in principle the issue runs deeper since its possible to do things like assign() in R code that would flummox the latent argument as well. Plus returned objects may not even be nlist objects (which is guaranteed by jags).
Another approach would be to only accept models defined in the JAGS code (which was clearly designed to be easily parsed) but then allow the user to have the JAGS code automatically converted to R code. In fact this is quite an interesting concept because it could mean that rjags could just be in the suggests. Thoughts?
Right. I think that if we were to allow R code we shouldn't put any constraint on what type of object the output should be. If its not an nlists then it can't be analysed by simanalyse_analyse_bayesian but there could be another function simanalyse_analyse_custom where R code can be provided on how to analyse the objects produced by sims.
Sounds like STAN might be possible though? I've never used STAN so you would know better, but it seems like what I suggested above could work, and data could be simulated following the example in http://modernstatisticalworkflow.blogspot.com/2017/04/an-easy-way-to-simulate-fake-data-from.html
Regarding the second suggestion it sounds cumbersome to implement and any additional function/feature in new jags releases would have to be handled manually. Also I would feel more confident using jags as there is no possibility for a typo/error on our part in the conversion of code. Another problem is that jags code does not necessarily have to be ordered i.e. a ~ dnorm(mu, 1) mu ~ dnorm(0,1) would not work if translated to R because mu needs to be generated before a
I agree it would be cumbersome and error prone to convert JAGS code to R. Ignore that idea!
Some thoughts
With regard R code not returning an nlist object - the lack of constraints may make the code so general and abstract as to not be very useful....
There may be someway in which using R code to generate nlist objects ties in with mcmc_derive...
I've only used STAN for fitting models not simulating data - STAN models are fast but the syntax is way less elegant than BUGS/JAGS
I've implemented sims now working with R code.
Take it for a spin and see what you think.
sims_simulate("a <- 1
b <- a", nsims = 1, stochastic = NA)
set.seed(101)
sims_simulate("a <- runif(1, 0, 1)", nsims = 1, stochastic = NA)
Amazing! And so elegantly coded!
This worked wonderfully: library(magrittr) set.seed(101) sims_simulate("for(i in 1:10){ a[i] = rnorm(1,0,1)} b <- a %>% sum", nsims = 1, stochastic = NA, latent=NA)
I noticed that generate_r does not take seed as argument but generate_jags does, is that going to be a problem when parallelizing?
I get errors when I try this:
doParallel::registerDoParallel(3)
library(magrittr)
set.seed(101)
sims_simulate("for(i in 1:10){
a[i] = rnorm(1,0,1)}
b <- a %>% sum", nsims = 3, stochastic = NA, latent=NA, parallel=TRUE)
Error in do.ply(i) : task 1 failed - "object 'a' not found"
In addition: Warning messages:
1:
2:
sims_simulate("a ~ dnorm(0,1)", nsims = 3, stochastic = NA, latent=NA, parallel=TRUE) $a [1] -0.6998554
an nlists object of 3 nlist objects each with 1 natomic elementWarning messages:
1:
2:
The problem is with your R code (it also errors out in a console)
> library(magrittr)
> set.seed(101)
> for(i in 1:10){
+ a[i] = rnorm(1,0,1)}
Error: object 'a' not found
> b <- a %>% sum
Error in eval(lhs, parent, parent) : object 'a' not found
Error: object 'a' not found
sims is replicable with R code (it doesn't need to formally pass a seed argument as it is passed implicitly through the .Random.seed
global parameter which is updated by rinteger(sim)
.
> set.seed(101)
> sims_simulate("a = rnorm(1,0,1)", nsims = 1, stochastic = NA)
$a
[1] -1.707928
an nlists object of an nlist object with 1 natomic element
>
> set.seed(101)
> sims_simulate("a = rnorm(1,0,1)", nsims = 1, stochastic = NA)
$a
[1] -1.707928
an nlists object of an nlist object with 1 natomic element
It should be fine when in parallel because the sim number is used to update the seed although I'll write a test to confirm.
Finally I've realized that we could do stochastic for R code if we have a look up table with all the random functions listed in ?Distributions
and treat these as stochastic.
Thoughts?
Actually, this works in regular mode but not in parallel mode.
library(magrittr) set.seed(101) sims_simulate("a <- vector() for(i in 1:10){ a[i] = rnorm(1,0,1)} b <- a %>% sum", nsims = 1, stochastic = NA, latent=NA, parallel=TRUE)
I think the problem is that in parallel one must pass libraries to the workers. So in parallel mode, %>% is not recognized.
I like the ?Distributions idea!
OK cool - I'll implement the distributions idea.
You are correct about packages not being passed: we can add a paropts argument to sims_simulate since we are using llply ie a list of additional options passed into the foreach function when parallel computation is enabled. This is important if (for example) your code relies on external data or packages: use the .export and .packages arguments to supply them so that all cluster nodes have the correct environment set up for computing.
added parapets argument and stochastic nodes now identified based on rdists
R code could be identified by absence of
~