Extensions for random grid assignment

damianooldoni commented 1 year ago

Suggested during during meeting on Sept 20

At the moment a uniform distribution is used in the random grid assignment method, i.e. all points within the circle have the same probability to be used to assign the occurrence to a grid cell.

@qgroom, @shawndove, @wlangera, @ToonVanDaele and I were discussing the robustness of the random grid assignment method and the following ideas emerged:

Is it possible to use a normal distribution while applying the random assignment? The normal distribution seems more suitable for data acquired using GPS technology.
There is also interest to test other distributions: is there flexibility enough to allow the user to provide a distribution as input? The chosen ditribution should be mentioned in the metadata; of course, as the seed used for the randomization.
Would it be possible to assign occurrences published without coordinateUncertaintyInMeters directly to the grid cell they belong to? Mathematically speaking this is also a kind of "random" assignment, which is following the δ distribution, where P(x) = 1 at one point and 0 everywhere else. Notice that a grid cell occupied only by these kind of points should have min_coord_uncertainty = NULL. I also wonder if GBIF allows the publication of occurrences with coordinateUncertaintyInMeters = 0.

To everybody mentioned above: feel free to add remarks/corrections/...

peterdesmet commented 12 months ago

What is the reasoning for 1? It seems to assume points are more likely to be collected near the center? That would be false for grid collected observations.

What is the reasoning for 3? It does seem more straightforward to document, which I like.

damianooldoni commented 12 months ago

The idea behind 1 is that the outer regions of the uncertainty circle are less likely than the inner region when location is retrieved via GPS. It's not really a Gaussian phenomenon, but it can be approximated with a normal distribution as a first approximation. In other words. Using the uniform distribution is in other words a way too safe condition. As you said, it will be flase for grid collected observations. But BMK will study the properties of cubes with GPS only data or with gridded data only. So, they would like to study what happens with a more fitted and realistic distribution for GPS data. @wlangera: maybe you can provide more insights?

The 3rd idea was launched by @shawndove: I think he would like to compare the results under the assumption that occurrences without coordinateUncertaintInMeters are still "precise" data. Probably he will limit his study on some datasets where he can show that such data are not gridded and precise enough to allow such assignment of occurrences to grid cells. @shawndove: any extra insight is welcome.

shawndove commented 12 months ago

I'm not sure I fully understand what @damianooldoni has written here, but the idea I proposed was to have an option to assign all occurrences to the grid cells they belong to so that the uncertainty can be processed downstream during the calculation of the indicator. Each occurrence would still have an uncertainty associated with it, but it would not be used for grid assignment when producing the cube.

wlangera commented 12 months ago

Hi, The three options described by @damianooldoni are based on some first analyses (simulations) that can be found in the simulations folder of the occurrence-cube-paper repo. More specifically in assignment_options_cube.Rmd we formulated these 3 potential options as follows:

uniform(0, coordinateUncertaintyInMeters)
- no modifiable parameters necessary
- each location within the uncertainty circle has an equal probability
- useful when there are multiple uncertainty generation processes involved or when the process(es) is (are) unknown
normal(mu = 0, var = -coordinateUncertaintyInMeters² / 2ln(1-p))
- p can be chosen by the user, default value is 0.95 (95 % of points fall within circle with radius equal to coordinateUncertaintyInMeters)
- locations near the center of the uncertainty circle have higher probability then near the edge
- there might be multiple reasons for which a normal distribution is more desirable than a uniform distribution that could be argued by the user
- e.g. all data come from one project where uncertainty is due to the GPS measuring device
No distribution
- cube based on point coordinates from GBIF without random assignment
- measure of uncertainty based on coordinateUncertaintyInMeters is given with each grid cell, e.g. minimal coordinate uncertainty as done with the TrIAS cube
- might be useful if the occurrence uncertainty wants to be included in the downstream analysis, e.g. when developing indicators, rather then using the uncertainty for creating the cube itself

@damianooldoni your description of the third idea is different from how @shawndove (his comment above) and I (point 3 here) understood it ...

So I think the questions are (@damianooldoni ?):

Is it possible to use a normal distribution while applying the random assignment? The normal distribution seems more suitable for data acquired using GPS technology.
There is also interest to test other distributions: is there flexibility enough to allow the user to provide a distribution as input? The chosen ditribution should be mentioned in the metadata; of course, as the seed used for the randomization.
Is it possible to do the assignmain non-randomly based on the point coordinates? To still have a measure of uncertainty per grid cell of the cube we could calculate a new variable e.g. minimalCoordinateUncertaintyInMeters. This could be used for downstream processing.

peterdesmet commented 11 months ago

Is it possible to use a normal distribution while applying the random assignment? The normal distribution seems more suitable for data acquired using GPS technology.

Yes, that can likely be offered as an option. I would still choose the uniform method as the default though, since "useful when there are multiple uncertainty generation processes involved or when the process(es) is (are) unknown" describes GBIF data from all sources pretty well.

There is also interest to test other distributions: is there flexibility enough to allow the user to provide a distribution as input? The chosen ditribution should be mentioned in the metadata; of course, as the seed used for the randomization.

Yes, and for now we can offer uniform and normal (I have added normal to the specs). I suggest that other distributions are implemented on a per use case basis.

Is it possible to do the assignmain non-randomly based on the point coordinates? To still have a measure of uncertainty per grid cell of the cube we could calculate a new variable e.g. minimalCoordinateUncertaintyInMeters. This could be used for downstream processing.

Yes, @MattBlissett has already implemented that is is possible to set the default uncertainty to 0, which will result in the point coordinates being used if no coordinateUncertainty is provided by the source. I would still keep 1000 as the default.

Note regarding the statement by @shawndove:

Each occurrence would still have an uncertainty associated with it, but it would not be used for grid assignment when producing the cube.

Occurrences are no longer identifiable in a cube, so their individual uncertainty would be lost. As @wlangera indicates, it would be possible to have a summarized value, such as minimal or maximum uncertainty for the selection of occurrences within a taxon/year/grid_cell row.

gbif / occurrence-cube

Extensions for random grid assignment #2