Investigate the difference in numbers of generalised and alreadyGeneralised

AtlasOfLivingAustralia / la-pipelines

Living Atlas Pipelines extensions

3 stars 4 forks source link

Investigate the difference in numbers of generalised and alreadyGeneralised #305

Closed djtfmartin closed 3 years ago

djtfmartin commented 3 years ago

We should investigate the difference in numbers between quoll and prod:

Quoll:

Prod:

cc @charvolant

charvolant commented 3 years ago

Looking at some examples, it looks like "already generalised" in the current system includes records with coordinates already below the precision imposed by the SDS, which leads to an "already generalized" term in the dataGeneralzation statement from the SDS. In pipelines, it's based on whether there's a specific dataGeneralizations or informationWithheld term set.

Converting to biocache-store treatment.

djtfmartin commented 3 years ago

the counts are more comparable to current production.

LGTM.

Code merged from this PR https://github.com/gbif/pipelines/pull/518 Moving to review - @javier-molina to sign off.

javier-molina commented 3 years ago

Sorry @djtfmartin I just noticed this was assigned to me but I'm not sure I'm the best to review this, however from the first comparison in the description to me it seems that one is using Data profiles and not the other has Data profiles disabled.

Here is what I have with Data profile enabled:

Screen Shot 2021-05-04 at 3 01 09 pm

Screen Shot 2021-05-04 at 3 02 47 pm

To me there is still something going on with Generalised count, we seem to be marking double what we did in the past.

cc @charvolant

charvolant commented 3 years ago

With profies disabled prod and quoll are much more in-line.

Prod

Quoll

There is something odd going on with the filters on prod, since it presents unrealistically low numbers of records removed.

charvolant commented 3 years ago

The most common generalised species Chelonia mydas (Green Turtle) has comparable numbers of generalised records with no profiles: 23384 quoll, 23428 prod). When quality filters are turned on, the numbers change to 23384 quoll and 28 prod. Many of these come from the IMOS data, which have spatial issues. For example https://aws-biocache-quoll.ala.org.au/occurrences/9face0c5-6007-43a5-974e-433b6ce91a04 flags the uncertainty/precision transposition as a warning but https://biocache.ala.org.au/occurrences/087eefe7-d283-476e-9f31-e9a508500095 treats these as a failure and also (incorrectly) flags a biome mismatch.

charvolant commented 3 years ago

@javier-molina I think this is a by-product of quality filters triggering in production but not in pipelines.

javier-molina commented 3 years ago

@M-Nicholls, @peggynewman do you think this is the case that the pipelines is doing a better job at flagging issues?

javier-molina commented 3 years ago

We will accept it as a fact unless otherwise demonstrated, there has been no further input to it.

M-Nicholls commented 3 years ago

@javier-molina , Sorry this took me so long to look at. With the information available I can't say whether pipelines is doing a better job at flagging issues. Pipelines looks to be flagging different issues and so the data pre-filtering is working differently - some of the assertions the filters rely on aren't generated which, along with the different total record counts, is resulting in different results.