Different approaches to geographical subsampling

dermestid commented 3 years ago

At present, the script works by forming geographical sets of sequences according to the set location division scheme, then takes subsamples (only from those location sets which are bigger than the sample size). Then aligns and builds trees etc.

A (possibly better) approach might be to subsample from the set of all sequences, and then assign a location to the subsample. This could be done after a "coarse" geographical division, to ensure that the range of locations does not cover a whole continent.

The advantage of sampling before location-splitting is that sparsely sampled areas do not get left out of a fine-grained geographical division. E.g., we can still use the 5 located sequences within the Sahara Desert, alongside the 1000s of sequences in precise locations across Sweden.

Disadvantages: 1. divisions overlap, so that plotting the divisions as blobs is not readable. They should instead be plotted as points (which may have error bars). 2. since locations are different in each run of the script, the tree length for a given area cannot be repeatedly calculated and averaged.

A more sophisticated approach to either the above or the present version would be to vary the geographical division scheme across the globe depending on sequence density. So e.g. a 2x2 grid in northern Europe and a 30x30 grid in central Africa.

dermestid commented 3 years ago

A refinement of the above ideas, generalizing a variable grid:

Have several "layers" of geographical division, going from coarse to fine.
First divide by the most coarse layer, and then within those divisions, divide by the next layer.
For each (coarse) division, check if there are at least MIN_SAMPLE_SIZE samples within every sub-division (excluding those that are empty or nearly empty). 3.1. If not, subsample within the coarser division and build trees. 3.2. If there are, use the finer division. Repeat from 2. using this and the next finer division.
Continue until we have stopped refining divisions, and use the generated trees.

dermestid commented 3 years ago

Worth further consideration, since coarse-to-fine grids are needed to speed up retreival of species data from GBIF

dermestid commented 3 years ago

Currently, coarse-to-fine gridding is avoided and GBIF species data retreival is sped up by only using squares which have sequences to calculate PD, which is usually a minority (a decreasing fraction of total squares as the grid becomes more fine).

See #112 note about MAUP. In order to at least allow the user to gain a suggestion of MAUP sensitivity, additional division scheme options are needed. At the least, grids should have x/y offset values in addition to grid size.

dermestid / bold-phylodiv-scripts

Different approaches to geographical subsampling #49