gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Detect large drops in record count #854

Open timrobertson100 opened 1 year ago

timrobertson100 commented 1 year ago

Bionomia processing detected a large drop in records from this dataset. The publisher has been contacted by helpdesk.

We suggest the code that detects changes in record IDs be modified to also detect if the dataset processing will reduce record count by a configurable >N% and notify data managers for verification before proceeding.

Does this make sense to do please @ManonGros?

ManonGros commented 1 year ago

I don't know if it is necessary to be honest. I would imagine it is a rare situation (compared to the number of times, a dataset is reduced voluntarily). I think this dataset is the first example like that I encounter (usually when something goes wrong, all the records disappeared). But if the implementation isn't too much work, I could see the value in at least making a warning message for those datasets.

dshorthouse commented 1 year ago

It is surprisingly more common than we think & the causes are difficult to pigeonhole, though I have noticed that it often depends on IPT vs custom-crafted DwC-A endpoints. On average, there are 2-3 datasets every two weeks that experience significant decreases in record count for me to investigate what might happen to links made in Bionomia & then reach out to a responsible party if necessary. The implementation at the pipelines end would have to set a bar for what is "significant" for any one dataset because it may not necessarily be a function of raw dataset size. For instance, a 1% drop in 1M-sized dataset may have a much greater downstream impact than a 1% drop in a 1K-sized dataset (or vice versa) - if only we had a mechanism to quantify that effect. Prior downloads made off GBIF in support of new research & the negative impact on repeatability would/could be one such place to look for the impact of a diminution of records, but this would grossly affect the simplicity of a pause in processing until a human steps-in.

MattBlissett commented 1 year ago

In case it's useful, there is an API showing the number of records each time a dataset was crawled.

https://api.gbif.org/v1/ingestion/history/e523cf2d-4fc2-48d8-aa75-80ecbc90b3f5 (with the usual limit, offset parameters).

It's probably best to look at .results[].pipelineExecutions[].steps[0].numberRecords in the VERBATIM_TO_IDENTIFIER step.

Note this isn't part of the documented, stable API.