Implement the possibility to use raster data

bodkan commented 3 years ago

I finally figured out a way to handle general raster spatial data using sf and stars in R. Brief example below:

library(ggplot2)
library(sf)
library(stars)
library(rnaturalearth)

#map <- ne_download(scale = 50, type = "NE1_50M_SR_W", category = "raster",
#                   returnclass = "sf", destdir = "~/Desktop/ne_maps")

map <- ne_load(scale = 50, type = "NE1_50M_SR_W", category = "raster",
               returnclass = "sf", destdir = "~/Desktop/ne_maps/")
map <- st_as_stars(map)

# plot rgb channels using ggplot2 - this is what makes the plot "look nice",
# similarly to the Natural Earth documentation
mapr <- st_rgb(map)
ggplot() + geom_stars(data = mapr, downsample = 10) + scale_fill_identity()

# by keeping the data in the sf/stars ecosystem, we can still add normal sf-based
# data to the raster map background
ggplot() + geom_stars(data = mapr, downsample = 10) + scale_fill_identity() +
  geom_sf(data = st_as_sf(data.frame(x = 0, y = 0), coords = c("x", "y")))

# subset to West Eurasia
subset_map <- ne_load(scale = 50, type = "NE1_50M_SR_W", category = "raster",
               returnclass = "sf", destdir = "~/Desktop/ne_maps/") %>%
  st_as_stars() %>%
  st_crop(st_bbox(c(xmin = -15, xmax = 60, ymin = 20, ymax = 65),
        crs = 4326)) %>%
  st_rgb()

# plot a coarse downsampled raster
ggplot() + geom_stars(data = subset_map, downsample = 10) +
  scale_fill_identity() +
  coord_sf()

# plot a finer  raster
ggplot() + geom_stars(data = subset_map, downsample = 5) +
  scale_fill_identity() +
  coord_sf()

# and an even finer raster
ggplot() + geom_stars(data = subset_map, downsample = 1) +
  scale_fill_identity() +
  coord_sf()

# plot data which is not downsampled (not feasible with the original world map,
# only works with zoomed in map)
ggplot() + geom_stars(data = subset_map) +
  scale_fill_identity() +
  coord_sf()

How would be go about using similar data sets (climate data, environmental data, etc.) for simulations? Loading the maps as separate image rasters on the SLiM side would be easy.

They could be refered to in fitness() callbacks? What if there would be more things affecting fitness other than the value(s) of raster(s) that an individual is standing on?

bodkan commented 2 years ago

With the slendr preprint being finished, this is now officially in the "immediate future plans" pipeline.

I will close this because I get anxious about unanswered issues. :) Will revisit this again when the time is right. In any case, this will require much more substantial changes than just "add proper raster maps" so it's not something that will be tacked as a single issue. Non-WF models and all that jazz.

bhaller commented 2 years ago

Hi @bodkan. Just a tip regarding unanswered issue anxiety. :-> I feel similarly. My coping mechanism is to use the "labels" feature of GitHub to manage my issues. I can tag some issues as "long term" if I'm not thinking about them right now, and when I'm looking for short-term work that needs to be done I can easily search for issues that do not have the "long term" tag and then those all get filtered out. You probably know this already, but just in case you don't. :->

bodkan commented 2 years ago

Hmm, yes, that's a good point. I did use the "labels" feature in the past but then dropped them over time. I should get back to it though. Especially using those "long term" tags might be helpful for issues like this. Thanks!

bodkan commented 2 years ago

This is a long post. In case you got a notification, it's not urgent to read any of this! That said, if you're bored at some point later, I would appreciate feedback. Thanks!

A couple of action points from a brainstorming session with @FerRacimo. We did not discuss anything technical, just broadly talked about the scope of what we would like to do on the short-to-midterm horizon in terms of moving over to raster-based non-WF models. None of this is set in stone (not a single line of code has been written to do this), but these seemed like reasonable aims and boundaries.

Input raster formats

We will assume that raster landscape maps will be in in the form of raster data cubes, with individual layers indexable by time encoded in standard slendr "model time units" (i.e. years before present, or generations before present, or years between 2022 and 2122, etc.). So the same time units that are used to specify population splits, gene flow, and other demographic events.
We will not provide (too much) functionality for integrating the countless number of geospatial raster data formats: it will be the user's job to pre-format their input rasters (climate, elevation, etc.) into something that can be loaded with the data cube raster R package stars, a sibling of the amazing sf package. Every bit of spatio-temporal information in slendr (tree-sequence node locations and times, population boundaries, etc.) is already sf-compatible, so we will keep the raster inputs under the same umbrella. If the user can operate sf-features of slendr, they will be able to work with rasters.
Similarly, because the landscape/climate raster data could represent infinite number of variables of different scales and measurement units (elevation, temperature, land cover type, other ecological niche parameters), we will assume that the raster input cubes for a slendr model is will be "normalized" on a [0 ... 1] scale, where:
- 0 == totally uninhabitable location
- 1 == the most amazing location possible What the distribution of "suitability" between 0 and 1 should be is currently unclear to me. Should it be linear? I guess?
If multiple different sources of environmental data would be needed (i.e. temperature and elevation and ...), it would be the users job to combine those to a single normalized cube (perhaps weighting each variable differently? up to the user).
Such "0-1 normalized" spatio-temporal raster cubes will be specified either for each population or for the whole world itself (shared by each population). I think it makes sense to specify this in a compile_model() phase. We could have a rasters = parameter in that function, which would be either a single cube for the whole model (used for all populations), or a named list of cubes (each element named as the population for which the cube will be used).

On a related topic, I will check out other simulation tools used in ecology (such as poems) to see what kinds of data formats people use for these "suitability rasters".

SLiM implementation

I'm currently thinking about using the 0-1 normalized raster cubes for the simplest possible carrying-capacity model. I'm imagining something like this (code from Bradburd and Ralph), leveraging slendr's current map-swapping callback function which is already in place (except that those implicitly built raster maps in slendr are now binary 0/1 maps). Then using this kind of fitness calculation to model the suitability of each point on the landscape.
I will need to re-read everything there's to be learned about non-WF modeling in SLiM. It's been a long time since I looked at that. I should also take a look at some recent exciting developments in SLiM, to see what the best way to do this is right now (I'm looking at you, localPopulationDensity()).

R implementation

This is mostly a question of fitting this in the current slendr interface. Which additional parameters will be needed? For instance, we have N number of individuals right now, but non-WF models will require additional parameters. Can we fit those in the same function interfaces? I'd prefer to do it this way.

The non-WF back end will need to be maintained separately. However, vast majority of the WF back end will be reusable without modification (I think). The way I see it, when populations will be created in R, their parametrization will already determine whether we're dealing with a WF or non-WF model, so compile_model() and slim() shouldn't need any user-facing changes.

Pinging @petrelharp and @bhaller in case they had suggestions or comments. Also pinging @mkiravn because a) she had been using the current WF-focused slendr for awesome spatial modeling work, b) she had to work around some WF-related spatial annoyances and might have ideas for things she wishes she had instead. 🙂 Also pinging @FerRacimo in case he wanted to add anything.

There's lots of cool stuff done in ecology that I don't know about. Any links to relevant prior work is highly appreciated!

Important note: none of this will be implemented for the slendr 1.0 that's described in the preprint and which will go under review at some point soon. So there's no hard time feasibility limit to what should we be aiming for here.

bhaller commented 2 years ago

Hi @bodkan. This all sounds very cool!

Yes, there have been recent improvements like localPopulationDensity() in SLiM that may be useful. You will have additional challenges in that respect, because you don't just want the "local interaction field integral" to depend upon the rectangular edges of the subpopulation's spatial boundary (which is what localPopulationDensity() does, courtesy of clippedIntegral()), but also, ideally, upon the actual amount of habitable area (i.e., land, typically) within the maximum interaction distance radius, I suppose. I.e., you want the local population density calculation to depend on, perhaps, the sum of the raster-based habitability values within that maximum radius, weighted by the value of the interaction function at each point, or something like that. It has occurred to me that that is something that people would want to do, but no work has gone toward enabling it yet. @petrelharp has been doing a lot of thinking about spatial modeling techniques, and intends to write a paper about it soon I think; he will probably have more to say on that topic. Perhaps one or both of you will eventually want to propose a new method in SLiM that would do the necessary calculations.

More broadly, a couple of comments:

(1) A tranche of improvements to the InteractionType methods and other things is coming down the pike, to better support multispecies models but also to better support interactions between individuals in different subpopulations. Backward compatibility is broken in several places, but not in any way that will cause you any real trouble I think – just minor renaming and such. That will be on the multispecies branch soon; harder to say when it will be merged into master.

(2) It sounds like you're thinking about environmental rasters primarily (solely?) from the perspective of habitat quality that scales fitness. You might want to consider other possible uses for them, I suppose? Some habitats are better for foraging, some are better for reproducing in, some are easier/harder to disperse across, etc. Of course you won't want to support all of that to begin with, but you might want to think about how to keep your design general/flexible so that your environmental rasters can be used for different purposes down the road; don't name all of the APIs regarding them with "habitability" in the names, etc. :->

(3) I guess you will need to resample environmental data that is at one scale/resolution/orientation to turn it into rasters for use in SLiM at another scale/resolution/orientation, including interpolating values and so forth. I don't know much about it, but I gather that is a large and complex topic; I think people called it "kriging"? https://en.wikipedia.org/wiki/Kriging. Maybe sf will handle all of those details for you, but if not, I think there's a whole literature on how to best do such interpolation/resampling.

That's all I can think of for now. Again, I'll let you know when the multispecies branch has been updated with the new InteractionType stuff. I'm excited about that, it greatly broadens the utility of those methods.

petrelharp commented 2 years ago

I am planning to write a paper on spatial modeling soon! (very soon!) And I agree, that should be a good reference.

you want the local population density calculation to depend on, perhaps, the sum of the raster-based habitability values within that maximum radius, weighted by the value of the interaction function at each point, or something like that.

Yes, probably - and, I remember thinking we needed this - however, that local-integral-of-a-raster is itself a raster, so can be pre-computed just once? We need to think through some concrete examples.

sallycylau commented 2 years ago

Hi! I just discovered this package and this is the coolest thing I have seen in a while!

Just a suggestion.... it would be really amazing to be able to include different raster over different time points (to constrain habitat fluctuation following glacial-interglacial cycles). This is, of course, just a minor suggestion from a fan of this package, as I think it would open many possibilities for evolutionary simulations!

Love the package!

bodkan commented 2 years ago

Hello @sallycylau!

Thank you for the kind words! I'm glad you're as excited about this software as I am. :)

Thank you also for the suggestion. Indeed, the plan to have the possibility to switch out rasters dynamically over time (i.e. having them inputted as data cubes) is very much in the works. 👍

When I make progress on this in hopefully not too distant future I will ping you here for comments and ideas.

bodkan / slendr