bodkan / slendr

Population genetic simulations in R šŸŒ
https://bodkan.net/slendr
Other
52 stars 5 forks source link

How to interpret location coordinates #89

Closed percyfal closed 2 years ago

percyfal commented 2 years ago

Opening a new issue as requested, cf issue 64.

I'm using the slendr example (verbatim copy-paste) model to generate a set of tree sequences in SLiM. The world coordinates are defined by the following code:

map <- world(
  xrange = c(-15, 60), # min-max longitude
  yrange = c(20, 65),  # min-max latitude
  crs = "EPSG:3035"    # coordinate reference system (CRS) for West Eurasia
)

When I load the tree sequence file generated by SLiM I see that the metadata field is populated with the following metadata:

import tskit
ts = tskit.load("output_slim.trees")
ts.metadata
{'SLiM': {'file_version': '0.7',
  'generation': 1734,
  'model_type': 'WF',
  'nucleotide_based': False,
  'separate_sexes': False,
  'spatial_dimensionality': 'xy',
  'spatial_periodicity': '',
  'stage': 'late',
  'user_metadata': {'slendr': [{'arguments': [{'BURNIN_LENGTH': [0],
       'MAX_ATTEMPTS': [1],
       'RECOMB_RATE': [1e-08],
       'SEED': [314159],
       'SEQUENCE_LENGTH': [10000000],
       'SIMULATION_LENGTH': [1733]}],
     'backend': ['SLiM'],
     'description': [''],
     'map': [{'crs': [3035],
       'extent': [1647066.10300782,
        117021.735439067,
        9388656.91382807,
        5472983.33914087],
       'resolution': [10000]}],
     'sampling': [{'n': [3000.0, 2000.0],
       'pop': ['AFR', 'EUR'],
       'time_gen': [1734, 1734],
       'time_orig': [10, 10],
       'x': [-1.0, -1.0],
       'x_orig': [-1.0, -1.0],
       'y': [-1.0, -1.0],
       'y_orig': [-1.0, -1.0]}],
     'version': ['slendr_0.1.0.9000']}]}}}

The first individual looks as follows:

ts.individuals()[0]
Individual(id=0, flags=458752, location=array([160.46683997, 108.24079614,   0.        ]), parents=array([17785, 18424], dtype=int32), nodes=array([370, 371], dtype=int32), metadata={'pedigree_id': 10760615, 'pedigree_p1': 10756318, 'pedigree_p2': 10757656, 'age': -1, 'subpopulation': 0, 'sex': -1, 'flags': 0})

My question is how to relate the location coordinates above ([160.46683997, 108.24079614, 0. ]) to long, lat (btw is 0. related to angle?)? Also what do the map:extent numbers refer to (map rectangle defined by [x0, y0, x1, y1]?)?

I mentioned using the python package geoviews for plotting. This example shows that one can use longitude, latitude coordinate pairs to define location, which is why I'm looking for a way to convert the SLiM location coordinates.

I hope this makes sense. I guess you have outputs already from your documentation example or do you want me to post the entire code sequence I used?

bodkan commented 2 years ago

Ah, yeah. I see. Thanks for the more detailed description.

Briefly, right now coordinates of slendr individuals in tree sequences are stored in a "raster-space" ("x" and "y" defined by width and height of the raster). This is actually not necessary because SLiM allows arbitrary spatial dimension units to be specified regardless of the raster dimensions.This is a historically artifact from the early days of slendr (before slendr even existed), not a technical limitation, because back then I didn't fully understand how SLiM spatiality works. It is something I want to change at some point but have not gotten to it yet, mostly because it's an internal issue that doesn't manifest in the R interface itself.

With the background out of the way, if you run

ts <- ts_load(model, recapitate = TRUE, simplify = TRUE, Ne = 10000, recombination_rate = 1e-8)

to get the tree sequence and then extract the spatio-temporal information with

ts_data(ts)

You should get something like this:

> data <- ts_data(ts)
> data

[... snip ...]

overview of the underlying sf object:

# A tibble: 18,245 Ɨ 11
   name  pop   ind_id node_id  time           location sampled remembered retained alive pedigree_id
   <chr> <fct>  <dbl>   <int> <dbl>        <POINT [m]> <lgl>   <lgl>      <lgl>    <lgl>       <dbl>
 1 AFR_1 AFR        0       0    10 (4171418 790022.5) TRUE    TRUE       TRUE     TRUE     10758709
 2 AFR_1 AFR        0       1    10 (4171418 790022.5) TRUE    TRUE       TRUE     TRUE     10758709
 3 AFR_2 AFR        1       2    10 (4320028 877509.2) TRUE    TRUE       TRUE     TRUE     10758710
 4 AFR_2 AFR        1       3    10 (4320028 877509.2) TRUE    TRUE       TRUE     TRUE     10758710
 5 AFR_3 AFR        2       4    10 (3611921 919330.8) TRUE    TRUE       TRUE     TRUE     10758711
 6 AFR_3 AFR        2       5    10 (3611921 919330.8) TRUE    TRUE       TRUE     TRUE     10758711
 7 AFR_4 AFR        3       6    10  (3004082 1020762) TRUE    TRUE       TRUE     TRUE     10758712
 8 AFR_4 AFR        3       7    10  (3004082 1020762) TRUE    TRUE       TRUE     TRUE     10758712
 9 AFR_5 AFR        4       8    10 (3690266 522249.6) TRUE    TRUE       TRUE     TRUE     10758713
10 AFR_5 AFR        4       9    10 (3690266 522249.6) TRUE    TRUE       TRUE     TRUE     10758713
# ā€¦ with 18,235 more rows

Note that the locations of nodes are in a projected coordinate system (units of meters) because this is the space in which the simulation is run (EPSG:3035 CRS that's used in the example minimizes the distortion of shapes/proportions/distances for West Eurasian continent).

What slendr does internally is that it transform the raster-based coordinates (0...width, 0...height) into the projected CRS used by the simulation. This is possible because all of that information is stored in the model slendr object, which is why it's needed in the ts_load() call. When a tree sequence is loaded by slendr, the ts_load() and other functions inspect the model object, and perform the spatial coordinate conversion automatically, under the hood.

If you want to get the coordinates back in the geographic CRS (latitude/longitude, which is encoded as EPSG:4326), you can convert it with standard sf functionality like this:

> sf::st_transform(data, crs = 4326)

[... snip ...]

overview of the underlying sf object:

# A tibble: 18,245 Ɨ 11
   name  pop   ind_id node_id  time             location sampled remembered retained alive pedigree_id
   <chr> <fct>  <dbl>   <int> <dbl>          <POINT [Ā°]> <lgl>   <lgl>      <lgl>    <lgl>       <dbl>
 1 AFR_1 AFR        0       0    10  (8.476209 30.06758) TRUE    TRUE       TRUE     TRUE     10758709
 2 AFR_1 AFR        0       1    10  (8.476209 30.06758) TRUE    TRUE       TRUE     TRUE     10758709
 3 AFR_2 AFR        1       2    10  (9.989998 30.88379) TRUE    TRUE       TRUE     TRUE     10758710
 4 AFR_2 AFR        1       3    10  (9.989998 30.88379) TRUE    TRUE       TRUE     TRUE     10758710
 5 AFR_3 AFR        2       4    10  (2.688129 30.96454) TRUE    TRUE       TRUE     TRUE     10758711
 6 AFR_3 AFR        2       5    10  (2.688129 30.96454) TRUE    TRUE       TRUE     TRUE     10758711
 7 AFR_4 AFR        3       6    10 (-3.663614 31.14246) TRUE    TRUE       TRUE     TRUE     10758712
 8 AFR_4 AFR        3       7    10 (-3.663614 31.14246) TRUE    TRUE       TRUE     TRUE     10758712
 9 AFR_5 AFR        4       8    10   (3.759709 27.3877) TRUE    TRUE       TRUE     TRUE     10758713
10 AFR_5 AFR        4       9    10   (3.759709 27.3877) TRUE    TRUE       TRUE     TRUE     10758713
# ā€¦ with 18,235 more rows

Note the location column now showing data points in degrees longitude/latitude.

Because this information is nothing but a slightly decorated data frame, you should be able to save it and load it into Python, maybe with something like geopandas?

Also, I wonder if working with the projected coordinates (here EPSG:3035, not the standard longitude/latitude EPSG:4326) wouldn't be more accurate, because this is what the 2D simulation landscape was actually defined with.


btw is 0. related to angle?

No, 0 is a potentially (currently unused) third dimension. We're simulating a 2D surface with [x,y] coordinates, but in principle, SLiM allows 3D coordinates for each individual (so [x,y,z], with z representing something like elevation). Even when a simulation occurs in 2D, SLiM still stores three coordinates, the last one being zero.


As an aside, the slendr metadata that's stored in a slendr tree sequence is not really standardized and not currently designed for public use. There's currently not a standardized way to store spatial information (locations in 2D/3D space, coordinate reference systems, cartographic data) in tskit in general, so the metadata that we are storing should be considered an internal matter and might change in the near future as the slendr internal change, more functionality is added, new simulation back ends implemented, or future standardization in tskit itself is added.

This is why using the ts_data() table result is a safe option, rather than poking into the raw metadata itself.

percyfal commented 2 years ago

Thanks for the detailed comment, now I understand much better how things relate and work. And you're probably right that it would be better to work directly with the projected coordinates. I have in mind testing the gnn plot on real data though where locations are given as lat,lon pairs, which is why I want to test it also on this data set.

I see now also that it is the pedigree_id that is the identifier of interest in the tree sequence individuals when comparing to e.g. the true locations.

I think I'll be able to figure out how to proceed, so I'll close the issue for now. Thanks!

percyfal commented 2 years ago

Rats - I actually had another question. In the test file test-ts.R you have a test where you generate a data frame joined, which looks like this for my data:

 head(joined, 2)
   name pop.x ind_id node_id time.x                 location sampled remembered
1 AFR_1   AFR      0     370     10  POINT (3249993 1198614)    TRUE       TRUE
2 AFR_2   AFR      1    2180     10 POINT (1896022 397850.5)    TRUE       TRUE
  retained alive pedigree_id gen pop_id pop.y        x        y time.y
1     TRUE  TRUE    10760615   0      0   AFR 160.4670 108.2410     10
2     TRUE  TRUE    10760616   0      0   AFR  24.9227  28.1041     10
  location_x location_y
1    3249993  1198614.2
2    1896022   397850.5

The columns x, y come from tree sequence individual data:

ts.individuals()[0], ts.individuals()[1]
(Individual(id=0, flags=458752, location=array([160.46683997, 108.24079614,   0.        ]), parents=array([17785, 18424], dtype=int32), nodes=array([370, 371], dtype=int32), metadata={'pedigree_id': 10760615, 'pedigree_p1': 10756318, 'pedigree_p2': 10757656, 'age': -1, 'subpopulation': 0, 'sex': -1, 'flags': 0}),
 Individual(id=1, flags=458752, location=array([24.92267718, 28.10405201,  0.        ]), parents=array([18675, 18059], dtype=int32), nodes=array([2180, 2181], dtype=int32), metadata={'pedigree_id': 10760616, 'pedigree_p1': 10758207, 'pedigree_p2': 10756874, 'age': -1, 'subpopulation': 0, 'sex': -1, 'flags': 0}))

Do I understand it correctly that in order to get the map coordinates the SLiM metadata will not suffice, but I also need the file output_ind_locations.tsv.gz?

bodkan commented 2 years ago

Wow, you actually crawled through my hideous unit tests -- I'm sorry! šŸ¤£ I should work on cleaning those up. šŸ˜³

Do I understand it correctly that in order to get the map coordinates the SLiM metadata will not suffice, but I also need the file output_ind_locations.tsv.gz?

No, the only reason I'm comparing the locations to what's in output_ind_locations.tsv.gz is to test that all the complicated spatio-temporal juggling I do internally in transforming coordinates back and forth really works as it should. I'm basically taking the table produced by ts_data() (generated through the pyslim/tskit interface of slendr, joining various low-level tree sequence tables) to what is saved from SLiM itself during the SLiM run (stored in output_ind_locations.tsv.gz).

The first vignette that you started with (and that I'm recommending to start with) is unfortunately a bit outdated. Or incomplete, rather. It originated in the time before slendr had any tree-sequence capabilities whatsoever.

Maybe this gives a bit more context.


Under normal circumstances, you never want to run slim(..., save_locations = TRUE) because it will be very slow. This is only useful for low-level debugging (it saves the location of everyone who ever lived) and for creating GIF animations for slides etc.


If you have a tree-sequence object obtained by something like ts <- ts_load(model), then calling ts_data(ts) on that object gets you all the spatial coordinates that you need for the nodes in your tree sequence. All the loading and processing happens internally, and all the spatial information needed for the conversions is tracked by the tree-sequence object internally (fun fact: you can run attr(ts, "model")$world and see that it's always there, hidden šŸ™‚).

bodkan commented 2 years ago

The columns x, y come from tree sequence individual data:

ts.individuals()[0], ts.individuals()[1] (Individual(id=0, flags=458752, location=array([160.46683997, 108.24079614, 0. ]), parents=array([17785, 18424], dtype=int32), nodes=array([370, 371], dtype=int32), metadata={'pedigree_id': 10760615, 'pedigree_p1': 10756318, 'pedigree_p2': 10757656, 'age': -1, 'subpopulation': 0, 'sex': -1, 'flags': 0}), Individual(id=1, flags=458752, location=array([24.92267718, 28.10405201, 0. ]), parents=array([18675, 18059], dtype=int32), nodes=array([2180, 2181], dtype=int32), metadata={'pedigree_id': 10760616, 'pedigree_p1': 10758207, 'pedigree_p2': 10756874, 'age': -1, 'subpopulation': 0, 'sex': -1, 'flags': 0})) Do I understand it correctly that in order to get the map coordinates the SLiM metadata will not suffice, but I also need the file output_ind_locations.tsv.gz?

The x and y columns are the raster-based pixel coordinates I mentioned in my first reply, locations in the coordinate space (raster matrix) which is what SLiM operates with. That's what slendr internally translates to real-world projected CRS, and what's then stored in the location column of an sf-formatted data.frame.

This is why you get the same x and y values from processing the "raw" tree sequence data from pyslim/tskit. Because this is the only thing that those Python modules know and that they have access to. The reason slendr provides richer data is because of the additional work on geospatial conversion using the world-map CRS etc that slendr does on the R side of things.

percyfal commented 2 years ago

Hi @bodkan, thanks for all your input, now I got it to work. In case you're interested an example of the final output is here: https://github.com/tskit-dev/tsviz/issues/7#.

/P