gbif / occurrence-annotation

Experimental: Rule based annotation store
Apache License 2.0
0 stars 1 forks source link

add datasetKey as a potential qualifier #17

Closed jhnwllr closed 1 year ago

jhnwllr commented 1 year ago

I have been testing how well the current vocabulary and rule structure works on example taxonKeys.

One weakness is that we have no way to handle when occurrences are in a plausible location but still suspicious.

I see two solutions to this scenario (without involving gbifIds).

  1. Fix occurrences at source, since this tool isn't meant to solve every data quality issue.
  2. Add datasetKey as a qualifier. So users can say datasetKey + taxonKey + Polygon -> "suspicious"

I am undecided whether this would be worth the added complexity.

Example I made with Lions. Occurrences in the yellow box are all from the same dataset and claim to be in the big national park above. image

https://jhnwllr.github.io/panthera-leo-example-range-annotations/

timrobertson100 commented 1 year ago

Thanks, @jhnwllr - I'm inclined to suggest we do implement something that allows for option 2. @MortenHofft already pondered whether we should just have a flexible scope where any combination of filter was possible. If it is too flexible it becomes difficult to consume (i.e. what rules apply for <this data download>?), but having e.g. a datasetScope and taxonScope (possibly more like temporalScope in the future) would be intuitive.

An alternative could be to create rules under the existing functionality using the scope of a dataset. However, those rules would be of the form dataset A + GEOM = "suspicious" which would flag records of any taxon, including genuine in-situ sightings. The motivation there was mainly to handle truly bogus coordinates (e.g. middle of the sea points against terrestrial-only data) where attempts to fix things at source failed. I suspect that is still useful in some cases.

timrobertson100 commented 1 year ago

I have changed the API by removing the current single-context approach (i.e. contextKey and contextType gone) and replaced it with the simpler taxonKey and datasetKey allowing for either or both to be provided.

Finding rules can then still be done by querying uring taxonKey or datasetKey, and taxon-based rules can be additionally scoped by the dataset when creating them.

@jhnwllr - please can you update your scripts, noting that taxonKey is an integer, not a string (see the README for an example)?