Closed bodkan closed 2 years ago
Hi @bodkan! I made a jupyternotebook trying to explain with a examples the issues I encountered.
Overall, I found 2 main problems with the sampling()
function:
sampling()
function to decide if an individual can be or can't be sampled. Hope this helps developing slendr
!
Thanks @MoiColl! I will look at your notebook carefully and use it to try to debug your issue.
A couple of immediate thoughts:
sampling()
only schedules sampling (i.e. remembering) of individuals from populations that are present at the time of sampling. If there are some inconsistent events scheduled (sampling before a population is created or after it will be removed), those are (well, should) be left out of the sampling. I.e. such events should never make it to the SLiM backend during the simulation. If a user wants a full control over this (i.e. if they want such inconsistent events to lead to error and be caught on the R side), there is an option strict = TRUE
(default FALSE
) in sampling()
.Is this what you ask for? I did not take a look at your notebook but I will. Maybe we implemented the same thing, maybe not and I will take some inspiration for you in that case.
Thanks again! You're the best 1st user a developer could wish for. :D
Thanks for your kind words!
I use the bare sample()
function in 3.1. section (without the scrict =
option) and when I simulate with slim()
function in 4.1. (scheduling to sample the samples in 3.1.) I get the following error:
Error on script line 4, character 8 (inside runtime script block):
stop("Population " + filter(POPULATIONS, "pop_id", i).getValue("pop") +
^^^^
R[write to console]: Error: SLiM simulation resulted in an error -- see the output above
Generation 34: remembering 1 individuals of AFR(p0)
Error: SLiM simulation resulted in an error -- see the output above
The function I implemented was to basically sample uniformly across time an exact number of individuals. Then, to make sure I was doing things right, I passed the sampling time and pop through sample()
. If you feel it could be an interesting function to incorporate to slendr()
we can go a bit more in detail about it, but the function itself is quite basic.
Hello again @MoiColl.
First of all, thank you again for such a detailed bug report.
This issue turn out to be a massive pain to debug and your notebook was extremely helpful. Especially the full slendr script which leads to errors in R or SLiM was critical to figuring out what's going on.
After a couple of days battling with this, I have now fixed two errors. Copying the description from the pull request:
The sampling()
function is now smarter in handling cases where individuals are being scheduled for sampling at the effectively in the same generation as their population is being created. For instance, this can arise when these two events (given as time in years) "collapse" to a single generation in SLiM. The sampling()
function can now detect this, and add those invalid samplings to those that are implicitly ignored (unless sampling(..., strict = TRUE)
in which the function complains to the user, helping them specify a strictly defined schedule if they wish so).
Sometimes (very rarely!) an individual was scheduled for sampling twice. I couldn't believe this when I saw this happen, which is why fixing this took much longer. This lead to the final tree sequence having one less individual then requested by the user via the sampling()
call, leading to crashes when ts_load
-ing the data.
Here is a small test case provided by @MoiColl which first reported these issues. The R package can now correctly deal with those problems.
These issues are very complex, and even more so because slendr allows flexible time specification in forward and in backward direction which complicates the sanity checks quite a lot. However, I'll close the issue for now and wait for other bugs to pop up.
@MoiColl: If you feel like it, would you mind testing the script again to see that things work properly now? The code below was copied from your notebook.
# fetch the latest slendr version with the bugfixes implemented
devtools::install_github("bodkan/slendr")
library(slendr)
set.seed(1234)
# this links to my Python setup -- you might do something else
reticulate::use_condaenv("retipy", required = TRUE)
reticulate::py_config()
map <- world(xrange = c(-15, 60), yrange = c(20, 65),crs = "EPSG:3035")
africa <- region("Africa", map, polygon = list(c(-18, 20), c(40, 20), c(30, 33),c(20, 32), c(10, 35), c(-8, 35)))
europe <- region("Europe", map, polygon = list(c(-8, 35), c(-5, 36), c(10, 38), c(20, 35), c(25, 35),c(33, 45), c(20, 58), c(-5, 60), c(-15, 50)))
anatolia <- region("Anatolia", map, polygon = list(c(28, 35), c(40, 35), c(42, 40), c(30, 43), c(27, 40), c(25, 38)))
afr <- population("AFR", parent = "ancestor", time = 52000, N = 3000, map = map, polygon = africa)
ooa <- population("OOA", parent = afr, time = 51000, N = 500, remove = 25000, center = c(33, 30), radius = 400e3) %>%
move(trajectory = list(c(40, 30), c(50, 30), c(60, 40)), start = 50000, end = 40000, snapshots = 20)
ehg <- population("EHG", parent = ooa, time = 28000, N = 1000, remove = 6000, polygon = list(c(26, 55), c(38, 53), c(48, 53), c(60, 53), c(60, 60), c(48, 63), c(38, 63), c(26, 60)))
eur <- population( name = "EUR", parent = ehg, time = 25000, N = 2000, polygon = europe)
ana <- population( name = "ANA", time = 28000, N = 3000, parent = ooa, remove = 4000, center = c(34, 38), radius = 500e3, polygon = anatolia) %>%
expand(by = 2500e3, start = 10000, end = 7000, polygon = join(europe, anatolia), snapshots = 20)
yam <- population( name = "YAM", time = 7000, N = 500, parent = ehg, remove = 2500, polygon = list(c(26, 50), c(38, 49), c(48, 50), c(48, 56), c(38, 59), c(26, 56))) %>%
move(trajectory = list(c(15, 50)), start = 5000, end = 3000, snapshots = 10)
gf <- list(
geneflow(from = ana, to = yam, rate = 0.5, start = 6500, end = 6400, overlap = FALSE),
geneflow(from = ana, to = eur, rate = 0.5, start = 8000, end = 6000),
geneflow(from = yam, to = eur, rate = 0.75, start = 4000, end = 3000))
model <- compile(
populations = list(afr, ooa, ehg, eur, ana, yam),
geneflow = gf,
generation_time = 30,
resolution = 10e3,
competition_dist = 130e3, mate_dist = 100e3,
dispersal_dist = 70e3,
dir = "~/Desktop/moi-model", overwrite = TRUE)
samples1 <- slendr::sampling(
model, times = runif(n = 1000, min = 2500, max = 52000),
list(afr, 1), list(ooa, 1), list(ehg, 1), list(eur, 1), list(ana, 1)
)
print(paste("Nº of samples : ", sum(samples1$n), sep = ""))
samples1 %>% summary() %>% print()
# this no longer crashes because the sampling events inconsistent with
# population split times are now handled by `sampling()` internally
slendr::slim(model,
sequence_length = 1,
recombination_rate = 0,
sampling = samples1,
verbose = TRUE,
save_locations = TRUE,
method = "batch",
seed = 1234,
save_sampling = TRUE)
# we no longer have the issue by some individuals gone missing, breaking
# the loading of tables of individuals
ts <- ts_load(model, recapitate = TRUE, simplify = TRUE,
Ne = 10000, recombination_rate = 0)
# verify we get the same number of samples from the tree sequence
ts_samples(ts) %>% nrow()
Reopening to allow more discussion, if needed.
Also, I didn't have yet a chance to look at your own sampling function you wrote as a workaround.
My hope is that the smarter sampling()
in slendr can now do something similar.
I'm currently starting to work on scheduling samplings in space (not just in time, so specifying where should the sampling occur) which will require modifications to sampling()
. Happy to chat about your solution and/or possible expansions to sampling()
after I have the draft of the spatial sampling finished (later this week I hope).
Hi @bodkan!
I've run again my example and it seems that the first issue has been solved! Now slim()
function runs fine with no errors reported.
However, when I try to load the tree structure generated from the second simulated sampling example (samples2
), it still reports the same error.
> ts_load(model)
Error: Can't recycle `..1` (size 1499) to match `..2` (size 1500).
Run `rlang::last_error()` to see where the error occurred.
Thanks for checking things this quickly. I'm glad to hear that sampling()
seems to be fixed for this case, and that the weird interaction with the SLiM backend dropping individuals scheduled by sampling()
is gone.
samples2
is the table you generated yourself, right? I'm not completely sure how to approach these situations. My reasoning would be that not using the built in sampling()
function means that the user must implement all sanity checks themselves. I.e., maybe there is some consistency testing done by sampling()
that your own function doesn't do?
For instance, in the last if (...) else ...
condition, I don't see your sampling_uniform_intime()
function checking for the split time of each population, or the time of its removal, etc. It might be not apparent to the user because it's a hidden implementation detail, but each slendr_pop
object carries internally its split time, removal time, and other annotation information, that sampling()
and other functions use for consistency checks.
The fact that your function makes SLiM apparently "drop" one individual suggests that sampling_uniform_intime()
misses some edge case. I don't think I can debug your function right now, but the point where you don't use sampling()
for scheduling but create your own table row seems like a likely culprit.
If you would rather use your own function, maybe you could utilize sampling()
for every time of sampling? I.e. let it handle the consistency checks, and use it to compose your own complex sampling schedule?
My hope was that sampling()
could be
sampling()
for everything, and let it take care of the internal details of each population they way it's built to do, rather than utilizing it only partially.To explain my reasoning a little better, let's look at the following scenario.
User creates a single population pop <- population("xyz", time = 1, remove = 100)
and wants to simulate 1000 generations.
If they schedule sampling like so sampling(model, times = c(50, 200), list(pop, 1)
, slendr will instruct SLiM to remember an individual at generation 50 because sampling
will ignore sampling at time 200. In the final tree sequence, only individual at time 50 will be remembered.
Now if the user instead creates a custom data frame schedule, sampling an individual at time 50 and 200? Because the population will be removed at time 100, the sampling at time 200 can never happen. Using ts_load()
will create a conflict. slendr assumes that two samples should be present (given by the user-defined schedule), but SLiM will only remember one individual.
If I were to make sure that every custom schedule provided by the user is valid, I would have to implement another series of checks for that. I don't think it makes a lot of sense to do that, if the same could be achieved simply by writing a bit of custom code internally utilizing the sampling(...)
function.
That said, let's revisit this once spatial sampling is implemented. I have started implementing this in this branch.
Just a heads up: spatial sampling is now implemented here (documentation is here). Everything is merged to a main branch.
I included two larger unit tests. Things seem to be working alright in the base case but I'll be keeping an eye on bugs as I'm using this functionality for examples in our manuscript.
I rerun my test (still using my custom sampling_uniform_intime()
) and it seems that ts_load()
is able to load the data without problems. I'm guessing when I updated slendr
back on the 7th of Dec (with devtools::install_github("bodkan/slendr")
), I might have not installed the latest version for some reason.
Answering your comment from the 7th of Dec:
samples2
is the table you generated yourself, right? I'm not completely sure how to approach these situations. My reasoning would be that not using the built insampling()
function means that the user must implement all sanity checks themselves. I.e., maybe there is some consistency testing done bysampling()
that your own function doesn't do?
To be clear, the sampling function I coded myself sampling_uniform_intime()
uses slendr
's sampling()
function internally to check that the sampling that I'll perform is correct.
Thanks for adressing the issues I found while using the sampling()
function!
I'm glad to hear that upgrading solved the issues (hopefully not creating many others :)) and thanks for reporting back!
there are some other minor comments in my jupyternotebook, in the header of the file, that you might want to check such as links not working properly or slendr conflicting with some tidyverse functions
I created a new issue about this so that I don't forget. Thanks again!
@MoiColl, the first brave soul using slendr for Real Science has encountered a puzzling issue with individuals scheduled for sampling in slendr actually not being recorded by SLiM. I suspect this is due to some issue with the sampling: maybe an individual was scheduled at a time at which a population was not yet created in the simulation? IIRC his issue was that 1000 individuals were scheduled for recording but SLiM saved only 999, which really does suggest some annoying "off-by-one" error of sorts.
Let's try to get to the bottom of this.
@MoiColl, whenever you feel like investing a bit more time into this, it would be great if you could provide here an example of a slendr script which would reproduce the error (please include also the random seed used in the simulation). Thank you!