Open dermestid opened 3 years ago
A refinement of the above ideas, generalizing a variable grid:
Worth further consideration, since coarse-to-fine grids are needed to speed up retreival of species data from GBIF
Currently, coarse-to-fine gridding is avoided and GBIF species data retreival is sped up by only using squares which have sequences to calculate PD, which is usually a minority (a decreasing fraction of total squares as the grid becomes more fine).
See #112 note about MAUP. In order to at least allow the user to gain a suggestion of MAUP sensitivity, additional division scheme options are needed. At the least, grids should have x/y offset values in addition to grid size.
At present, the script works by forming geographical sets of sequences according to the set location division scheme, then takes subsamples (only from those location sets which are bigger than the sample size). Then aligns and builds trees etc.
A (possibly better) approach might be to subsample from the set of all sequences, and then assign a location to the subsample. This could be done after a "coarse" geographical division, to ensure that the range of locations does not cover a whole continent.
The advantage of sampling before location-splitting is that sparsely sampled areas do not get left out of a fine-grained geographical division. E.g., we can still use the 5 located sequences within the Sahara Desert, alongside the 1000s of sequences in precise locations across Sweden.
Disadvantages: 1. divisions overlap, so that plotting the divisions as blobs is not readable. They should instead be plotted as points (which may have error bars). 2. since locations are different in each run of the script, the tree length for a given area cannot be repeatedly calculated and averaged.
A more sophisticated approach to either the above or the present version would be to vary the geographical division scheme across the globe depending on sequence density. So e.g. a 2x2 grid in northern Europe and a 30x30 grid in central Africa.