Closed djtfmartin closed 3 years ago
Looking at some examples, it looks like "already generalised" in the current system includes records with coordinates already below the precision imposed by the SDS, which leads to an "already generalized" term in the dataGeneralzation statement from the SDS. In pipelines, it's based on whether there's a specific dataGeneralizations or informationWithheld term set.
Converting to biocache-store treatment.
the counts are more comparable to current production.
LGTM.
Code merged from this PR https://github.com/gbif/pipelines/pull/518 Moving to review - @javier-molina to sign off.
Sorry @djtfmartin I just noticed this was assigned to me but I'm not sure I'm the best to review this, however from the first comparison in the description to me it seems that one is using Data profiles and not the other has Data profiles disabled.
Here is what I have with Data profile enabled:
To me there is still something going on with Generalised count, we seem to be marking double what we did in the past.
cc @charvolant
With profies disabled prod and quoll are much more in-line.
Prod
Quoll
There is something odd going on with the filters on prod, since it presents unrealistically low numbers of records removed.
The most common generalised species Chelonia mydas (Green Turtle) has comparable numbers of generalised records with no profiles: 23384 quoll, 23428 prod). When quality filters are turned on, the numbers change to 23384 quoll and 28 prod. Many of these come from the IMOS data, which have spatial issues. For example https://aws-biocache-quoll.ala.org.au/occurrences/9face0c5-6007-43a5-974e-433b6ce91a04 flags the uncertainty/precision transposition as a warning but https://biocache.ala.org.au/occurrences/087eefe7-d283-476e-9f31-e9a508500095 treats these as a failure and also (incorrectly) flags a biome mismatch.
@javier-molina I think this is a by-product of quality filters triggering in production but not in pipelines.
@M-Nicholls, @peggynewman do you think this is the case that the pipelines is doing a better job at flagging issues?
We will accept it as a fact unless otherwise demonstrated, there has been no further input to it.
@javier-molina , Sorry this took me so long to look at. With the information available I can't say whether pipelines is doing a better job at flagging issues. Pipelines looks to be flagging different issues and so the data pre-filtering is working differently - some of the assertions the filters rely on aren't generated which, along with the different total record counts, is resulting in different results.
We should investigate the difference in numbers between quoll and prod:
Quoll:
Prod:
cc @charvolant