Remembered individuals missing despite being scheduled for sampling

bodkan commented 2 years ago

@MoiColl, the first brave soul using slendr for Real Science has encountered a puzzling issue with individuals scheduled for sampling in slendr actually not being recorded by SLiM. I suspect this is due to some issue with the sampling: maybe an individual was scheduled at a time at which a population was not yet created in the simulation? IIRC his issue was that 1000 individuals were scheduled for recording but SLiM saved only 999, which really does suggest some annoying "off-by-one" error of sorts.

Let's try to get to the bottom of this.

@MoiColl, whenever you feel like investing a bit more time into this, it would be great if you could provide here an example of a slendr script which would reproduce the error (please include also the random seed used in the simulation). Thank you!

MoiColl commented 2 years ago

Hi @bodkan! I made a jupyternotebook trying to explain with a examples the issues I encountered.

Overall, I found 2 main problems with the sampling() function:

When sampling many individuals, sometimes there are some scheduled to be sampled at the same generation when their population is created. I could solve this issue by creating a function (included in the jupyternotebook also) which only allows sampling individuals 1 generation after the population they belong to is created and 1 generation before the population gets extinct. So, it might be worth using this criteria in the sampling() function to decide if an individual can be or can't be sampled.
There are individuals scheduled to be sampled that for some reason SLiM does not seem to be sampling. I could identify one case. This individual is not close to a demographic event (geneflow, population creation/destruction...) and it seems that other individuals are successfully sampled after and before it. I don't have any good intuition for why this might be.

Hope this helps developing slendr!

bodkan commented 2 years ago

Thanks @MoiColl! I will look at your notebook carefully and use it to try to debug your issue.

A couple of immediate thoughts:

By default, sampling() only schedules sampling (i.e. remembering) of individuals from populations that are present at the time of sampling. If there are some inconsistent events scheduled (sampling before a population is created or after it will be removed), those are (well, should) be left out of the sampling. I.e. such events should never make it to the SLiM backend during the simulation. If a user wants a full control over this (i.e. if they want such inconsistent events to lead to error and be caught on the R side), there is an option strict = TRUE (default FALSE) in sampling().

Is this what you ask for? I did not take a look at your notebook but I will. Maybe we implemented the same thing, maybe not and I will take some inspiration for you in that case.

Thanks for providing a reproducible example of this error. I will run your code and see if a) I can reproduce the issue, b) can fix it.

Thanks again! You're the best 1st user a developer could wish for. :D

MoiColl commented 2 years ago

Thanks for your kind words!

I use the bare sample() function in 3.1. section (without the scrict = option) and when I simulate with slim() function in 4.1. (scheduling to sample the samples in 3.1.) I get the following error:

Error on script line 4, character 8 (inside runtime script block):

        stop("Population " + filter(POPULATIONS, "pop_id", i).getValue("pop") +
        ^^^^
R[write to console]: Error: SLiM simulation resulted in an error -- see the output above

Generation 34: remembering 1 individuals of AFR(p0)

Error: SLiM simulation resulted in an error -- see the output above

The function I implemented was to basically sample uniformly across time an exact number of individuals. Then, to make sure I was doing things right, I passed the sampling time and pop through sample(). If you feel it could be an interesting function to incorporate to slendr() we can go a bit more in detail about it, but the function itself is quite basic.

bodkan commented 2 years ago

Hello again @MoiColl.

First of all, thank you again for such a detailed bug report.

This issue turn out to be a massive pain to debug and your notebook was extremely helpful. Especially the full slendr script which leads to errors in R or SLiM was critical to figuring out what's going on.

After a couple of days battling with this, I have now fixed two errors. Copying the description from the pull request:

The sampling() function is now smarter in handling cases where individuals are being scheduled for sampling at the effectively in the same generation as their population is being created. For instance, this can arise when these two events (given as time in years) "collapse" to a single generation in SLiM. The sampling() function can now detect this, and add those invalid samplings to those that are implicitly ignored (unless sampling(..., strict = TRUE) in which the function complains to the user, helping them specify a strictly defined schedule if they wish so).
Sometimes (very rarely!) an individual was scheduled for sampling twice. I couldn't believe this when I saw this happen, which is why fixing this took much longer. This lead to the final tree sequence having one less individual then requested by the user via the sampling() call, leading to crashes when ts_load-ing the data.

Here is a small test case provided by @MoiColl which first reported these issues. The R package can now correctly deal with those problems.

These issues are very complex, and even more so because slendr allows flexible time specification in forward and in backward direction which complicates the sanity checks quite a lot. However, I'll close the issue for now and wait for other bugs to pop up.

@MoiColl: If you feel like it, would you mind testing the script again to see that things work properly now? The code below was copied from your notebook.

# fetch the latest slendr version with the bugfixes implemented
devtools::install_github("bodkan/slendr")

library(slendr)

set.seed(1234)

# this links to my Python setup -- you might do something else
reticulate::use_condaenv("retipy", required = TRUE)
reticulate::py_config()

map <- world(xrange = c(-15, 60), yrange = c(20, 65),crs = "EPSG:3035")

africa   <- region("Africa",   map, polygon = list(c(-18, 20), c(40, 20), c(30, 33),c(20, 32), c(10, 35), c(-8, 35)))
europe   <- region("Europe",   map, polygon = list(c(-8, 35), c(-5, 36), c(10, 38), c(20, 35), c(25, 35),c(33, 45), c(20, 58), c(-5, 60), c(-15, 50)))
anatolia <- region("Anatolia", map, polygon = list(c(28, 35), c(40, 35), c(42, 40), c(30, 43), c(27, 40), c(25, 38)))

afr <- population("AFR", parent = "ancestor", time = 52000, N = 3000, map = map, polygon = africa)
ooa <- population("OOA", parent = afr, time = 51000, N = 500, remove = 25000, center = c(33, 30), radius = 400e3) %>%
  move(trajectory = list(c(40, 30), c(50, 30), c(60, 40)), start = 50000, end = 40000, snapshots = 20)
ehg <- population("EHG", parent = ooa, time = 28000, N = 1000, remove = 6000, polygon = list(c(26, 55), c(38, 53), c(48, 53), c(60, 53), c(60, 60), c(48, 63), c(38, 63), c(26, 60)))
eur <- population( name = "EUR", parent = ehg, time = 25000, N = 2000, polygon = europe)
ana <- population( name = "ANA", time = 28000, N = 3000, parent = ooa, remove = 4000, center = c(34, 38), radius = 500e3, polygon = anatolia) %>%
  expand(by = 2500e3, start = 10000, end = 7000, polygon = join(europe, anatolia), snapshots = 20)
yam <- population( name = "YAM", time = 7000, N = 500, parent = ehg, remove = 2500, polygon = list(c(26, 50), c(38, 49), c(48, 50), c(48, 56), c(38, 59), c(26, 56))) %>%
  move(trajectory = list(c(15, 50)), start = 5000, end = 3000, snapshots = 10)

gf <- list(
  geneflow(from = ana, to = yam, rate = 0.5,  start = 6500, end = 6400, overlap = FALSE),
  geneflow(from = ana, to = eur, rate = 0.5,  start = 8000, end = 6000),
  geneflow(from = yam, to = eur, rate = 0.75, start = 4000, end = 3000))

model <- compile(
  populations      = list(afr, ooa, ehg, eur, ana, yam),
  geneflow         = gf,
  generation_time  = 30,
  resolution       = 10e3,
  competition_dist = 130e3, mate_dist = 100e3,
  dispersal_dist   = 70e3,
  dir              = "~/Desktop/moi-model", overwrite = TRUE)

samples1 <- slendr::sampling(
  model, times = runif(n = 1000, min = 2500, max = 52000),
  list(afr, 1), list(ooa, 1), list(ehg, 1), list(eur, 1), list(ana, 1)
)

print(paste("Nº of samples : ", sum(samples1$n), sep = ""))
samples1 %>% summary() %>% print()

# this no longer crashes because the sampling events inconsistent with
# population split times are now handled by `sampling()` internally
slendr::slim(model,
             sequence_length    = 1,
             recombination_rate = 0,
             sampling           = samples1,
             verbose            = TRUE,
             save_locations     = TRUE,
             method             = "batch",
             seed               = 1234,
             save_sampling      = TRUE)

# we no longer have the issue by some individuals gone missing, breaking
# the loading of tables of individuals
ts <- ts_load(model, recapitate = TRUE, simplify = TRUE,
              Ne = 10000, recombination_rate = 0)

# verify we get the same number of samples from the tree sequence
ts_samples(ts) %>% nrow()

bodkan commented 2 years ago

Reopening to allow more discussion, if needed.

bodkan commented 2 years ago

Also, I didn't have yet a chance to look at your own sampling function you wrote as a workaround.

My hope is that the smarter sampling() in slendr can now do something similar.

I'm currently starting to work on scheduling samplings in space (not just in time, so specifying where should the sampling occur) which will require modifications to sampling(). Happy to chat about your solution and/or possible expansions to sampling() after I have the draft of the spatial sampling finished (later this week I hope).

MoiColl commented 2 years ago

Hi @bodkan!

I've run again my example and it seems that the first issue has been solved! Now slim() function runs fine with no errors reported.

However, when I try to load the tree structure generated from the second simulated sampling example (samples2), it still reports the same error.

> ts_load(model)
Error: Can't recycle `..1` (size 1499) to match `..2` (size 1500).
Run `rlang::last_error()` to see where the error occurred.

bodkan commented 2 years ago

Thanks for checking things this quickly. I'm glad to hear that sampling() seems to be fixed for this case, and that the weird interaction with the SLiM backend dropping individuals scheduled by sampling() is gone.

samples2 is the table you generated yourself, right? I'm not completely sure how to approach these situations. My reasoning would be that not using the built in sampling() function means that the user must implement all sanity checks themselves. I.e., maybe there is some consistency testing done by sampling() that your own function doesn't do?

For instance, in the last if (...) else ... condition, I don't see your sampling_uniform_intime() function checking for the split time of each population, or the time of its removal, etc. It might be not apparent to the user because it's a hidden implementation detail, but each slendr_pop object carries internally its split time, removal time, and other annotation information, that sampling() and other functions use for consistency checks.

The fact that your function makes SLiM apparently "drop" one individual suggests that sampling_uniform_intime() misses some edge case. I don't think I can debug your function right now, but the point where you don't use sampling() for scheduling but create your own table row seems like a likely culprit.

If you would rather use your own function, maybe you could utilize sampling() for every time of sampling? I.e. let it handle the consistency checks, and use it to compose your own complex sampling schedule?

My hope was that sampling() could be

used for all sampling schedules in simpler cases
used as a smaller function in more complex user-defined scheduling code. My suggestion would be (because it seems you want a custom schedule) to use sampling() for everything, and let it take care of the internal details of each population they way it's built to do, rather than utilizing it only partially.

bodkan commented 2 years ago

To explain my reasoning a little better, let's look at the following scenario.

User creates a single population pop <- population("xyz", time = 1, remove = 100) and wants to simulate 1000 generations.

If they schedule sampling like so sampling(model, times = c(50, 200), list(pop, 1), slendr will instruct SLiM to remember an individual at generation 50 because sampling will ignore sampling at time 200. In the final tree sequence, only individual at time 50 will be remembered.

Now if the user instead creates a custom data frame schedule, sampling an individual at time 50 and 200? Because the population will be removed at time 100, the sampling at time 200 can never happen. Using ts_load() will create a conflict. slendr assumes that two samples should be present (given by the user-defined schedule), but SLiM will only remember one individual.

If I were to make sure that every custom schedule provided by the user is valid, I would have to implement another series of checks for that. I don't think it makes a lot of sense to do that, if the same could be achieved simply by writing a bit of custom code internally utilizing the sampling(...) function.

That said, let's revisit this once spatial sampling is implemented. I have started implementing this in this branch.

bodkan commented 2 years ago

Just a heads up: spatial sampling is now implemented here (documentation is here). Everything is merged to a main branch.

I included two larger unit tests. Things seem to be working alright in the base case but I'll be keeping an eye on bugs as I'm using this functionality for examples in our manuscript.

MoiColl commented 2 years ago

I rerun my test (still using my custom sampling_uniform_intime()) and it seems that ts_load() is able to load the data without problems. I'm guessing when I updated slendr back on the 7th of Dec (with devtools::install_github("bodkan/slendr")), I might have not installed the latest version for some reason.

Answering your comment from the 7th of Dec:

samples2 is the table you generated yourself, right? I'm not completely sure how to approach these situations. My reasoning would be that not using the built in sampling() function means that the user must implement all sanity checks themselves. I.e., maybe there is some consistency testing done by sampling() that your own function doesn't do?

To be clear, the sampling function I coded myself sampling_uniform_intime() uses slendr's sampling() function internally to check that the sampling that I'll perform is correct.

Thanks for adressing the issues I found while using the sampling() function!

bodkan commented 2 years ago

I'm glad to hear that upgrading solved the issues (hopefully not creating many others :)) and thanks for reporting back!

there are some other minor comments in my jupyternotebook, in the header of the file, that you might want to check such as links not working properly or slendr conflicting with some tidyverse functions

I created a new issue about this so that I don't forget. Thanks again!

bodkan / slendr

Remembered individuals missing despite being scheduled for sampling #69