audrey-b / sims

An R package to simulate data from JAGS or R Code
https://poissonconsulting.github.io/sims/
Other
1 stars 2 forks source link

add ability to use string of R code instead of JAGS code #11

Closed joethorley closed 5 years ago

joethorley commented 5 years ago

R code could be identified by absence of ~

audrey-b commented 5 years ago

I like the idea.

Here is another one if we want more flexibility: we can use an engine= argument instead that would default to R but the user can say engine="jags" when they want to simulate with jags. That would give us more flexibility in case we ever want to simulate with STAN or anything else.

Then if an error ensues because the user didn't specify engine=JAGS, the code is checked to see if there is a ~, and warning is printed to say that jags was used?

joethorley commented 5 years ago

I would be cool to use STAN etc. It may be possible to identify all the engines we need from the code structures so to keep things as elegant as possible I think we should hold off on the engine argument until we hit a case where we need it.

joethorley commented 5 years ago

it's not going to work with R code or even STAN code because we need the <- and ~ to determine stochasticity unless we save the names of all possible functions that generate random numbers.

audrey-b commented 5 years ago

Instead on focusing on stochasticity, we could focus on "data". For example, in R we would define data as anything that is on the left hand side of <- or = but that is never used on the right hand side.

There would be no point for a user to define a variable on the left hand side if it is not later going to be used in the data generation process, unless it is data or some sort of constant that they are interested in.

So basically it seems to me that whatever the engine, we can ensure that we output all the data and derived constants. Anything else (something defined on the left and reused on the right) would be considered a latent variable. (Technically this could be a constant so we might want to rethink the choice of the term "latent", perhaps "intermediate")

Am I missing something?

In R, the operator %<>% might complicate things a bit as it implies that something that only appears on the left was implicitely reused on the right but that could be handled as well.

audrey-b commented 5 years ago

I am now seing that you've included an argument stochastic= and we could probably keep it but it would only apply to jags code

joethorley commented 5 years ago

That is correct. However in principle the issue runs deeper since its possible to do things like assign() in R code that would flummox the latent argument as well. Plus returned objects may not even be nlist objects (which is guaranteed by jags).

Another approach would be to only accept models defined in the JAGS code (which was clearly designed to be easily parsed) but then allow the user to have the JAGS code automatically converted to R code. In fact this is quite an interesting concept because it could mean that rjags could just be in the suggests. Thoughts?

audrey-b commented 5 years ago

Right. I think that if we were to allow R code we shouldn't put any constraint on what type of object the output should be. If its not an nlists then it can't be analysed by simanalyse_analyse_bayesian but there could be another function simanalyse_analyse_custom where R code can be provided on how to analyse the objects produced by sims.

Sounds like STAN might be possible though? I've never used STAN so you would know better, but it seems like what I suggested above could work, and data could be simulated following the example in http://modernstatisticalworkflow.blogspot.com/2017/04/an-easy-way-to-simulate-fake-data-from.html

Regarding the second suggestion it sounds cumbersome to implement and any additional function/feature in new jags releases would have to be handled manually. Also I would feel more confident using jags as there is no possibility for a typo/error on our part in the conversion of code. Another problem is that jags code does not necessarily have to be ordered i.e. a ~ dnorm(mu, 1) mu ~ dnorm(0,1) would not work if translated to R because mu needs to be generated before a

joethorley commented 5 years ago

I agree it would be cumbersome and error prone to convert JAGS code to R. Ignore that idea!

joethorley commented 5 years ago

Some thoughts

With regard R code not returning an nlist object - the lack of constraints may make the code so general and abstract as to not be very useful....

There may be someway in which using R code to generate nlist objects ties in with mcmc_derive...

I've only used STAN for fitting models not simulating data - STAN models are fast but the syntax is way less elegant than BUGS/JAGS

joethorley commented 5 years ago

I've implemented sims now working with R code.

Take it for a spin and see what you think.

sims_simulate("a <- 1
                           b <- a", nsims = 1, stochastic = NA) 
set.seed(101)
sims_simulate("a <- runif(1, 0, 1)", nsims = 1, stochastic = NA)
audrey-b commented 5 years ago

Amazing! And so elegantly coded!

This worked wonderfully: library(magrittr) set.seed(101) sims_simulate("for(i in 1:10){ a[i] = rnorm(1,0,1)} b <- a %>% sum", nsims = 1, stochastic = NA, latent=NA)

audrey-b commented 5 years ago

I noticed that generate_r does not take seed as argument but generate_jags does, is that going to be a problem when parallelizing?

I get errors when I try this:

doParallel::registerDoParallel(3) library(magrittr) set.seed(101) sims_simulate("for(i in 1:10){ a[i] = rnorm(1,0,1)} b <- a %>% sum", nsims = 3, stochastic = NA, latent=NA, parallel=TRUE) Error in do.ply(i) : task 1 failed - "object 'a' not found" In addition: Warning messages: 1: : ... may be used in an incorrect context: ‘.fun(piece, ...)’

2: : ... may be used in an incorrect context: ‘.fun(piece, ...)’

sims_simulate("a ~ dnorm(0,1)", nsims = 3, stochastic = NA, latent=NA, parallel=TRUE) $a [1] -0.6998554

an nlists object of 3 nlist objects each with 1 natomic elementWarning messages: 1: : ... may be used in an incorrect context: ‘.fun(piece, ...)’

2: : ... may be used in an incorrect context: ‘.fun(piece, ...)’

joethorley commented 5 years ago

The problem is with your R code (it also errors out in a console)

> library(magrittr)
> set.seed(101)
> for(i in 1:10){
+ a[i] = rnorm(1,0,1)}
Error: object 'a' not found
> b <- a %>% sum
Error in eval(lhs, parent, parent) : object 'a' not found

Error: object 'a' not found

sims is replicable with R code (it doesn't need to formally pass a seed argument as it is passed implicitly through the .Random.seed global parameter which is updated by rinteger(sim).

> set.seed(101)
> sims_simulate("a = rnorm(1,0,1)", nsims = 1, stochastic = NA)
$a
[1] -1.707928

an nlists object of an nlist object with 1 natomic element
> 
> set.seed(101)
> sims_simulate("a = rnorm(1,0,1)", nsims = 1, stochastic = NA)
$a
[1] -1.707928

an nlists object of an nlist object with 1 natomic element
joethorley commented 5 years ago

It should be fine when in parallel because the sim number is used to update the seed although I'll write a test to confirm.

joethorley commented 5 years ago

Finally I've realized that we could do stochastic for R code if we have a look up table with all the random functions listed in ?Distributions and treat these as stochastic.

Thoughts?

audrey-b commented 5 years ago

Actually, this works in regular mode but not in parallel mode.

library(magrittr) set.seed(101) sims_simulate("a <- vector() for(i in 1:10){ a[i] = rnorm(1,0,1)} b <- a %>% sum", nsims = 1, stochastic = NA, latent=NA, parallel=TRUE)

I think the problem is that in parallel one must pass libraries to the workers. So in parallel mode, %>% is not recognized.

audrey-b commented 5 years ago

I like the ?Distributions idea!

joethorley commented 5 years ago

OK cool - I'll implement the distributions idea.

joethorley commented 5 years ago

You are correct about packages not being passed: we can add a paropts argument to sims_simulate since we are using llply ie a list of additional options passed into the foreach function when parallel computation is enabled. This is important if (for example) your code relies on external data or packages: use the .export and .packages arguments to supply them so that all cluster nodes have the correct environment set up for computing.

joethorley commented 5 years ago

added parapets argument and stochastic nodes now identified based on rdists