Closed bodkan closed 1 year ago
Hmm, actually, in principle all ts_*()
statistics functions could operate on those replicates, calculate CIs etc... so maybe they could be made "replicate-aware".
But in the interest of simplicity, let's take things one simple step at a time.
That would be very cool. It might be enough to store the results in a list to which ts_*()
statistics could be easily lapply
'd.
The way I've been getting round this is with an R skeleton. First step is to create a grid of parameters (could just as easily be one set of parameters, but with multiple replicates, for example:
# setting up parameters
n <- 2000
Ns <- n
mat_dists <- c(0.1) # mate distance
comp_dists <- c(0.2) # competition distance
disp_dists <- c(1,4) # mean dispersal distance
disp_funs <- c("brownian")
reps <- 1 # here could be a vector of more replicates
ngens <- 50
# let's make a parameter table
pars <- expand.grid(N=Ns,
mat_dist=mat_dists,
comp_dist=comp_dists,
disp_fun=disp_funs,
disp_dist=disp_dists,
rep=reps,
ngen=ngens)
Now, I will set up the simulation and iterate through the grid:
# defining a region
map <- world(
xrange = c(-20, 20), # min-max longitude
yrange = c(-20, 20), # min-max latitude
landscape = "blank"
)
# these will hold the trees
trees_big <- c()
trees_big_us <- c()
i <- 1 # starting index
for (row in c(1:dim(pars)[1])){ # iterate through parameter table
# define a population
pop <- population("POP",
time = 1,
N = pars[row,"N"],
center = c(0,0),
radius = 20,
map = map,
mate_dist = pars[row,"mat_dist"],
competition_dist = pars[row,"comp_dist"],
dispersal_fun = as.character(pars[row,"disp_fun"]),
dispersal_dist = pars[row,"disp_dist"])
# compile and run the model
model <- compile(
populations = pop,
generation_time = 1,
sim_length = ngens,
resolution = 1,
path = "~/Desktop/test-model",
overwrite = TRUE
)
# sampling
samples <- sampling(model, times = ngens, list(pop, n/2))
# simulate
slim(
model, sampling = samples,
sequence_length = 1,
recombination_rate = 0, # simulate only a single locus
method = "batch", # change to "gui" to execute the model in SLiMgui
random_seed = i,
verbose = FALSE,
retainall = TRUE,
burnin = 0,
)
# extract trees
ts <- ts_load(model) %>%
ts_recapitate(Ne = pars[row,"N"],
recombination_rate = 0) %>%
ts_simplify()
# extract unsimplified trees
tsu <- ts_load(model, simplify = FALSE)
# collect trees
trees_big <- c(trees_big,ts)
trees_big_us <- c(trees_big_us,tsu)
print(paste("Simulation number:",i))
i <- i+1
}
In order to control the seed, I use the row of the parameter grid. Some data frame fiddling now lets me pool a bunch of results into one data frame, each annotated with the parameter combination it was run with.
tree_data_big <- data.frame()
# merge all the spatial tree data into a dataframe to plot
for (row in c(1:dim(pars)[1])){
tree <- trees_big_phylo[[row]]
tree_data_big_i <- tree %>% ts_data() %>% rowwise() %>% mutate(N=pars[row,"N"],
fun=pars[row,"disp_fun"],
rep=as.factor(pars[row,"rep"]),
mat_dist=pars[row,"mat_dist"],
comp_dist=pars[row,"comp_dist"],
disp_dist=pars[row,"disp_dist"],
sim=as.factor(row),
x=unlist(location)[1],
y=unlist(location)[2]) %>% ungroup()
tree_data_big <- rbind(tree_data_big,tree_data_big_i)
print(paste("Converting tree no",row))
}
# merge all the spatial tree data into a dataframe to plot
tree_data_bigu <- c()
for (row in c(1:dim(pars)[1])){
tree <- trees_big_us[[row]]
tree_data_big_i <- tree %>% ts_data() %>% rowwise() %>% mutate(N=pars[row,"N"],
fun=pars[row,"disp_fun"],
rep=as.factor(pars[row,"rep"]),
mat_dist=pars[row,"mat_dist"],
comp_dist=pars[row,"comp_dist"],
disp_dist=pars[row,"disp_dist"],
sim=as.factor(row),
x=unlist(location)[1],
y=unlist(location)[2]) %>% ungroup()
tree_data_bigu <- rbind(tree_data_bigu,tree_data_big_i)
print(paste("Converting tree no",row))
}
This is not an elegant solution, but it does the job! Further on it might be nice to be able to feed slendR not just replicates, but also vectors of parameters which it could run on a grid. But that might be annoying for now! Simply being able to run replicates would already approach this issue.
In the end I have decided that adding an additional layer of paralelization across the ts_*()
functions to make parallel runs of msprime()
and slim()
worthwhile is probably not worth the effort. In particular, the tradeoff with complicating the package internals to make this happen would not bring as many benefits as I'd like.
This is especially true given how beautifully elegant and simple-to-use the future family of R packages make it to do implicit parallel computation.
If anyone encounters this issue at a later point, consider taking a look at the link above. I'm personally a huge fan of the packages furrr and future.apply. In fact, I've been using this approach heavily in my work on the new R package demografr (which is also what made me appreciate how much work it is to implement easy parallelization in a satisfying and robust way).
It came up a couple of times already, so it's about time to implement it in the slendr interface.
It would be useful to have a way to have slendr automatically generate multiple simulation replicates from a single compiled model.
I'm thinking having something like
slim(model_object, ..., nreps = <number of replicates)
andmsprime(model_object, ..., nreps = <number of replicates)
. Maybe even with the optionalparallel = <TRUE|FALSE>
?What this would do is create a set of outputs such as
output_slim_rep1.trees
, etc. (in case of automatically named output tree-sequence files when no output prefix is specified) or<prefix>_slim_rep1.trees
in case a user-defined output name has been specified.Also, at this moment if a
ts_load(model)
is run and multiple tree-sequence files are present in the default model output directory, slendr complains about not knowing which tree sequence to load. There is a parameterts_load(model, file = <path to a tree-sequence file>, ...)
which takes care of this. So I wouldn't introduce a replicate number there, but would let the user specify the full path.Thoughts, @mkiravn?