Add an option to automatically run simulation replicates (including parallel runs)

bodkan commented 2 years ago

It came up a couple of times already, so it's about time to implement it in the slendr interface.

It would be useful to have a way to have slendr automatically generate multiple simulation replicates from a single compiled model.

I'm thinking having something like slim(model_object, ..., nreps = <number of replicates) and msprime(model_object, ..., nreps = <number of replicates). Maybe even with the optional parallel = <TRUE|FALSE>?

What this would do is create a set of outputs such as output_slim_rep1.trees, etc. (in case of automatically named output tree-sequence files when no output prefix is specified) or <prefix>_slim_rep1.trees in case a user-defined output name has been specified.

Also, at this moment if a ts_load(model) is run and multiple tree-sequence files are present in the default model output directory, slendr complains about not knowing which tree sequence to load. There is a parameter ts_load(model, file = <path to a tree-sequence file>, ...) which takes care of this. So I wouldn't introduce a replicate number there, but would let the user specify the full path.

Thoughts, @mkiravn?

bodkan commented 2 years ago

Hmm, actually, in principle all ts_*() statistics functions could operate on those replicates, calculate CIs etc... so maybe they could be made "replicate-aware".

But in the interest of simplicity, let's take things one simple step at a time.

mkiravn commented 2 years ago

That would be very cool. It might be enough to store the results in a list to which ts_*() statistics could be easily lapply'd.

The way I've been getting round this is with an R skeleton. First step is to create a grid of parameters (could just as easily be one set of parameters, but with multiple replicates, for example:

# setting up parameters
n <- 2000
Ns <- n
mat_dists <- c(0.1) # mate distance
comp_dists <- c(0.2) # competition distance
disp_dists <- c(1,4) # mean dispersal distance
disp_funs <- c("brownian")
reps <- 1 # here could be a vector of more replicates
ngens <- 50

# let's make a parameter table
pars <- expand.grid(N=Ns,
                    mat_dist=mat_dists,
                    comp_dist=comp_dists,
                    disp_fun=disp_funs,
                    disp_dist=disp_dists,
                    rep=reps,
                    ngen=ngens)

Now, I will set up the simulation and iterate through the grid:

# defining a region
map <- world(
  xrange = c(-20, 20), # min-max longitude
  yrange = c(-20, 20), # min-max latitude
  landscape = "blank"
)

# these will hold the trees
trees_big <- c()
trees_big_us <- c()
i <- 1 # starting index

for (row in c(1:dim(pars)[1])){ # iterate through parameter table

  # define a population
  pop <- population("POP",
                    time = 1,
                    N = pars[row,"N"],
                    center = c(0,0),
                    radius = 20,
                    map = map,
                    mate_dist = pars[row,"mat_dist"],
                    competition_dist = pars[row,"comp_dist"],
                    dispersal_fun = as.character(pars[row,"disp_fun"]),
                    dispersal_dist = pars[row,"disp_dist"])

  # compile and run the model
  model <- compile(
    populations = pop,
    generation_time = 1,
    sim_length = ngens,
    resolution = 1,
    path = "~/Desktop/test-model",
    overwrite = TRUE
  )

  # sampling
  samples <- sampling(model, times = ngens, list(pop, n/2))

  # simulate
  slim(
    model, sampling = samples,
    sequence_length = 1,
    recombination_rate = 0, # simulate only a single locus
    method = "batch", # change to "gui" to execute the model in SLiMgui
    random_seed = i,
    verbose = FALSE,
    retainall = TRUE,
    burnin = 0,
  )

  # extract trees
  ts <- ts_load(model) %>%
    ts_recapitate(Ne = pars[row,"N"],
                  recombination_rate = 0) %>%
    ts_simplify()
  # extract unsimplified trees
  tsu <- ts_load(model, simplify = FALSE)

  # collect trees
  trees_big <- c(trees_big,ts)
  trees_big_us <- c(trees_big_us,tsu)

  print(paste("Simulation number:",i))
  i <- i+1
}

In order to control the seed, I use the row of the parameter grid. Some data frame fiddling now lets me pool a bunch of results into one data frame, each annotated with the parameter combination it was run with.

tree_data_big <- data.frame()

# merge all the spatial tree data into a dataframe to plot
for (row in c(1:dim(pars)[1])){

  tree <- trees_big_phylo[[row]]

  tree_data_big_i <- tree %>% ts_data() %>% rowwise() %>% mutate(N=pars[row,"N"],
                                                                fun=pars[row,"disp_fun"],
                                                                rep=as.factor(pars[row,"rep"]),
                                                                mat_dist=pars[row,"mat_dist"],
                                                                comp_dist=pars[row,"comp_dist"],
                                                                disp_dist=pars[row,"disp_dist"],
                                                                sim=as.factor(row),
                                                                x=unlist(location)[1],
                                                                y=unlist(location)[2]) %>% ungroup()

  tree_data_big <- rbind(tree_data_big,tree_data_big_i)
  print(paste("Converting tree no",row))
}

# merge all the spatial tree data into a dataframe to plot
tree_data_bigu <- c()
for (row in c(1:dim(pars)[1])){

  tree <- trees_big_us[[row]]

  tree_data_big_i <- tree %>% ts_data() %>% rowwise() %>% mutate(N=pars[row,"N"],
                                                                fun=pars[row,"disp_fun"],
                                                                rep=as.factor(pars[row,"rep"]),
                                                                mat_dist=pars[row,"mat_dist"],
                                                                comp_dist=pars[row,"comp_dist"],
                                                                disp_dist=pars[row,"disp_dist"],
                                                                sim=as.factor(row),
                                                                x=unlist(location)[1],
                                                                y=unlist(location)[2]) %>% ungroup()

  tree_data_bigu <- rbind(tree_data_bigu,tree_data_big_i)
  print(paste("Converting tree no",row))
}

This is not an elegant solution, but it does the job! Further on it might be nice to be able to feed slendR not just replicates, but also vectors of parameters which it could run on a grid. But that might be annoying for now! Simply being able to run replicates would already approach this issue.

bodkan commented 1 year ago

In the end I have decided that adding an additional layer of paralelization across the ts_*() functions to make parallel runs of msprime() and slim() worthwhile is probably not worth the effort. In particular, the tradeoff with complicating the package internals to make this happen would not bring as many benefits as I'd like.

This is especially true given how beautifully elegant and simple-to-use the future family of R packages make it to do implicit parallel computation.

If anyone encounters this issue at a later point, consider taking a look at the link above. I'm personally a huge fan of the packages furrr and future.apply. In fact, I've been using this approach heavily in my work on the new R package demografr (which is also what made me appreciate how much work it is to implement easy parallelization in a satisfying and robust way).

bodkan / slendr

Add an option to automatically run simulation replicates (including parallel runs) #88