Supporting dataset deltas

djtfmartin commented 4 years ago

This is still a requirement from certain data providers including the natural history collections to support delta updates.

The current implementation of the GBIF pipelines only supports full extracts provided in DwCA as the main data input.

timrobertson100 commented 4 years ago

Challenges with deltas can include handling deletions, the need to replay a strict order of mutations (typically handled with some transaction log or so), handling failure cases were a delta is dropped or fails halfway through applying and the need to reprocess the whole dataset on e.g. a bug fix code deployment. All of these are solvable of course.

One obvious approach could be to have a store upstream of this processing pipeline that is responsible for merging the delta and presenting the collection as a complete entity. This is probably easy given collections are generally small datasets of a few million entries at most. This is not all that dissimilar to e.g. a ABCD dataset being assembled by a process paging over the BioCASe protocol.

peggynewman commented 4 years ago

It's definitely worth reviewing the requirement for deltas.

djtfmartin commented 4 years ago

@peggynewman @RobinaSanderson @nickdos

Some stats for datasets in ALA. These come from the collectory:

Protocol	Count
DIGIR	58
DwCA	141
DwC	4471 (765 without merit/biocollect)
AutoFeed	1

So we only have 1 delta data source in use, and this is for Australian Museum. This is the legacy delta format (non DwCA). The museums must have over time moved over to DwCA exports.

The DwC number is high because it references 2078 datasets from biocollect and 1628 from merit both of which are largely empty.

33 Datasets are coming in via SFTP uploads which includes most of the museum, herbaria data. Some of these could be deltas in that they arent complete exports, just additions.

The DIGIR counts can probably be ignored as we dont harvest from them.

RobinaSanderson commented 4 years ago

I've just talked with @chrisala regarding Biocollect and MERIT, just a couple of things to note, but on the whole deltas are not important to them.

There are a heap of biocollect and merit dataset entries that were created automatically and the data is not able to be shared publicly.
Now Biocollect projects can choose to opt in to share data.
There is work underway to remove the datasets from the collectory that haven't opted in (it would probably be good to do before we migrate these datasets to the new infrastructure) - I think this being worked on by Mathilda
If they want to update the data in a dataset, it is a full import each time. The job is scheduled via Jenkins, with a Jenkins pull from ecodata (Chris wasn't sure what the scheduled timeframe is)
They don't expect performance issues requiring the use of deltas as the largest biocollect dataset was the Sightings dataset (200K+) but this has been superseded by iNaturalist. Then next largest is 70K
Deltas might be something to think about if the biocollect - camera trap integration gets up, but even then they are expecting records in the 10s of thousands

MERIT doesn't yet make records publicly accessible, and Chris doesn't anticipate an issue updating by full import if they do.

So that leaves the 141 DwCA and 765 DwC.

@djtfmartin Can we determine if they are actively using a delta update, or if a full import is done for an update? I'm just wondering how many datasets use this function.

djtfmartin commented 4 years ago

thanks @RobinaSanderson Ive sent you message about this on slack.

peggynewman commented 4 years ago

Executive Summary The pipeline should not process deltas, only full refreshes. Deltas are not great practice for data loads because they add a lot of complexity to debugging, managing history, and QA. Full refreshes remove the requirement for a delete process. This approach gets us closer to a hands off system. Discuss.

...

@djtfmartin I think that the numbers you've put up above there about deltas don't reflect reality. They look like flags set on the collectory, whereas I don't think those flags are strictly used in Jenkins jobs. I've observed that it is almost standard practice to upload a handful of record updates at a time into a data resource, and biocache-store does an upsert (find the record and update, insert new record if not found). Deletes are more or less manual.

Despite this, deltas have created a lot of problems for us. With sometimes unreliable index builds are unreliable and UUID matching for various reasons, I've had great difficulties getting the same number of records out in SOLR as what I've put into an input file. The deltas add a lot of complexity that make it hard to debug and fix problems.

If we build something that manages deltas, what would it do? Thinking out loud:

grab the latest DwCA (10 records, 5 exist already in the DR of 100 records)
convert to verbatim AVRO
interpretations etc
merge - AVRO now contains 105 records.

Presuming that the benefit to even having deltas is that we don't have to pipeline-reprocess the whole dataset, I guess we merge the AVROs at the end of the process. The risk there is that the merged datasets are processed at different times, possibly on different code bases I guess. Does the final dataset have the right number of records? It will be (it is now) a nightmare trying to manage how many records a dataset should actually have. Debugging data issues will be a lot more difficult and will almost certainly start with reprocessing a whole dataset end to end before tackling the problem.

If we want to track what's loaded, processing deltas means we have to keep a systematic log of what we've applied. We don't do that at the moment, although if you wade through the collectory you'll find free text fields full of dates and activity descriptions where folk have seen the need to document this kind of thing. We don't know what happened in any systematic fashion.

A full refresh system doesn't have to explicitly delete records - we reload. If we don't have a full refresh system, then we have to explicitly write something to delete from AVRO files. We have to set up a manual process to map the records' UUIDs and run deletes. The data providers have to explicitly tell us about the records we have to delete, and we have to schedule the work, and confirm that it was successful. Messy.

Yes, I guess all of these things are solvable - we could do the merges, we could keep an even rudimentary log of what we've loaded, we could try to use it to keep track of load counts and record counts, we could flag whether a process is a full refresh or a delta, we could explicitly write delete functionality on the AVRO files. But I think we would end up with a system similar to what we have now - very difficult to debug, to confirm record counts and QA, very manual processes to delete records and ultimately still requiring a devops engineer to debug individual data loads.

If we manage towards full refreshes, where the ingested DwCA always contains the full set of records for the data resource:

debugging, QA, and deleting through the pipeline becomes very simple
we can encourage data providers to give us full refreshes, giving them better control over what is loaded and moving towards a hands off system
we can keep a history of archives, so be able to rollback if needed.

The disadvantages I can think of are:

reprocessing means doing the whole data resource, not just the incrementals. Is that extra cost? time?
IO with data providers - some source files might be really big to manage eg if someone sends us lots of physical images in a zip. (I don't think this is a major problem)
if data providers only want to provide deltas, then the data management team needs to merge dwc archives. (But that's not hard)

So I propose that we don't process deltas, and close this issue. Nor should we write something to handle deletes. The pipeline should be left to process the entire data resource at a time as per GBIF.

Edited: I'm suggesting that we push the delta and delete work out to the data management team to manage in the DwC-A versions that I think we'd need a bit of support with:

making sure our migration using the DwCA is schmick as
work out a strategy for using the IPT with major data provider (I've asked Niels to touch base with Tim on this)
collate some common, open source Darwin Core file manipulation code in Python and R (eg finch in R reads DwCA but doesn't write at this stage). Convert dwc.csv -> dwca.zip. Merge dwca1.zip + dwca2.zip -> dwca3.zip) Share these with data providers and teach them how to use them.
Provide general knowledge base materials on how to prep data. Lean heavily on the great work GBIF et al have already done on this.
setup and productionise the central file store that I've been banging on about, which contains the current DwCA, plus historical ones, plus all of the raw source data we receive. This will way simplify QA, deltas and deletes.

djtfmartin commented 4 years ago

Thanks @peggynewman. Just to get a better ideal of the scale of the issue, what is the number of data resources for which we are actively managing deltas ?

@timrobertson100 is there any community developed tools providing DwCA merge capabilities in use by other GBIF nodes ?

timrobertson100 commented 4 years ago

@timrobertson100 is there any community developed tools providing DwCA merge capabilities in use by other GBIF nodes ?

Not that I am aware of, sorry.

Remember that a DwC-A can only convey edits but not deletions, so whatever format you accept from the providers needs to be more expressive.

At some point, you will suffer drift between your view and the provider's view and will need a process to gain consistency again. A DwC-A would be a good option to provide a checkpoint for that.

peggynewman commented 4 years ago

@djtfmartin I'll have to try and compile a list. But we really want this list anyway.

@timrobertson100 If we retain and manage historical versions of DwC-A, then deletions are expressed as the difference between versions. Since I've been in this role, requests from providers to delete records have been to remove everything and reload.

javier-molina commented 3 years ago

A proposed Data Ingestion Change Management project will take care of this and other cases.

We will keep this open until there is more information about the new project.

javier-molina commented 3 years ago

LA Pipelines does not support deltas, DM team already developed a merge mechanism to create full DwCA.

peggynewman commented 3 years ago

DM team will come up with a way of handling deletes from APIs for some data loads, then I think we can close this baby. Can't wait.

AtlasOfLivingAustralia / la-pipelines

Supporting dataset deltas #11