AtlasOfLivingAustralia / DataQuality

Data Quality
0 stars 0 forks source link

Detect if a species occurrence record is within it's expected spatial distribution #255

Open M-Nicholls opened 3 years ago

M-Nicholls commented 3 years ago

Where should this occur - part of the pipelines or a separate process?

check layers are available outlier detection

run expert distribution outlier detection - is there an expert distribution for the species, if so detect if a species occurrence record point is in/out of the expert distribution

add a distance of the point inside/outside expected distribution field to the record

add expert distribution outlier category (compare the distance inside/outside the distribution boundary to the uncertainty)

Two scenarios:

1, Calculate all exisiting occurrences with existing expert distribution layers - one-time run 2, Re-calculate the related species when a new export distribution layer is added.

Link to pipeline issue: https://github.com/gbif/pipelines/issues/622 Link to Spatial issue: https://github.com/AtlasOfLivingAustralia/spatial-service/issues/186

M-Nicholls commented 3 years ago

what to do with generalised records how to take record uncertainty into account

use the size of the distribution to determine how much the uncertainty or generalisation matters? i.e. for a very small distribution uncertainty and generalisation will make a big difference as to whether the point is in or out should records be considered in or out if it's uncertainty puts it in the range but the point is outside the range?

indicate the point is in/out but based on the uncertainty the record may be out/in

categories - within expected distribution - point and full uncertainty are within the range likely within expected distribution - point within the range uncertainty is out may be within expected distribution - point outside the range and uncertainty overlaps the range outside expected distribution - point outside the range and uncertainty outside the range

use of categories and distance outside distribution provides a through combination of metrics

M-Nicholls commented 3 years ago

Add to data pre-filters update assertion metadata update support material

M-Nicholls commented 3 years ago

what to do if there are multiple overlapping layers - e.g. likely | maybe layers and separate east coast/west coats layers e.g. grey nurse shark

qifeng-bai commented 2 years ago

what to do if there are multiple overlapping layers - e.g. likely | maybe layers and separate east coast/west coats layers e.g. grey nurse shark

Single layer / multi layers won't affect the calculation of in/out of layers, but it brings difficulty in calculating distance

qifeng-bai commented 2 years ago

Solution: Jenkins schedules to run the program once every day.

For every run: Pipelines loads all indexed records Comparing with the existing outlier records, filter the new added records Calculate outliers of those new records ONLY.

If a new expert layer is added or updated, manually deleted exisiting outlier records, then Pipelines will recalculated all index records