AtlasOfLivingAustralia / data-management

Data management issue tracking
7 stars 0 forks source link

Plan Data Quality Campaign #371

Open M-Nicholls opened 6 years ago

elywallis commented 6 years ago

A couple of thoughts to start:

djtfmartin commented 6 years ago

Using IBRA/IMCRA is a great idea.

I'm looking closely at GBIF's new pipelines for potentially replacing our backend data ingestion of occurrence data. This is part of move to more closely align our data processing code with GBIF (essentially using the same codebase, but with ALA extensions).

This would affect the data quality tests we run. Happy to explain more.

elywallis commented 6 years ago

re the pipelines - Nick dos has explained the intent to me briefly. Sounds like a good thing to work towards but what do you think the timeline is? As in, if the pipelines are not likely to be implemented for a year or more, I think we still need to do a local data quality campaign in the meantime.

djtfmartin commented 6 years ago

Good question on the timeline. GBIF are actively working on it now....

From ALA's point of view, we'd need to use Pipelines and add some extensions for a few things ALA do (using Australian classification, sensitive data processing, sampling of layers etc). But the bulk of the darwin core interpretation should be common code.

cc @timrobertson100

timrobertson100 commented 6 years ago

Thanks @djtfmartin for alerting me to this issue.

I'd like to expand on some of the work from the GBIF Secretariat relating to this.

I would suggest we tackle data quality on several approaches concurrently. There are content issues we can address such as improving taxonomic reference catalogues while at the same time addressing coding issues (e.g. parsers), and actively engaging data providers to make changes.

We are very aware of the annoyance from publishers on how data is treated by "aggregators like ALA, GBIF and ..." and we believe this is a top priority to address. In some cases the criticism is perhaps misplaced (we lack documentation) but in other cases we clearly have much work to do. The presentations at the TDWG conference this year were a good reminder of how widespread this concern is.

Fundamentally, the target must be to clean data at source as much as possible. If clean before sharing on the internet then subsequent interpretations / flagging etc becomes less necessary.

The data team of GBIF are actively approaching publishers and trying to encourage and assist in cleaning data issues. We make use of our own issue flagging services (i.e. flags on https://www.gbif.org/occurrence/search) to locate problems and then engage over email. We also have the GBIF data validator which helps locate some issues (but is not itself without bugs). We also follow up directly with publishers when issues are reported through our feedback mechanisms as quickly as possible. I think it is fair to say the majority of publishers appreciate this.

Secondly, one of the most frequent issues reported relate to the backbone taxonomy and poor matches and we are striving towards quicker turnarounds of that taxonomy. When issues are reported, we capture them as "unit tests" for the next iteration of work to ensure we progress them.

Thirdly, we are working to improve our ingestion pipeline, as Dave points out. This is a fairly large re-architecture aiming to :

  1. Simplify the technology stack (Java into Elasticsearch)
  2. Run at any scale from laptop, server to cluster (GBIF cloud or public cloud)
  3. Allow us to progressively refine the data models to expand the information we can deal with - our existing architecture did not allow this easily
  4. Allow us to reprocess all data (1 billion+ records) in around 12 hours to deploy fixes quickly and respond to issues raised

It is our hope that this eventually forms the codebase that powers the "ALA ingest" command and means we can handle data consistently across systems, and collaborate in the truest sense on some shared products. We are aware that such a goal will take time, compromises and changing our approaches.

We have just got the first end-to-end proof-of-concept of this pipeline running and can process all GBIF data into an Elasticsearch index in under 12 hours - this uses the same interpretation and flagging however as the current GBIF.org system. The next steps for us will be to document a specification for how we interpret all fields, which we will start next week and will solicit comments. This will include how we (will) change records, what flags we will apply (taking note of the TDWG DQ outcomes) and define the output formats we expect to offer. Much of this is documenting what we (and you) already do but we will also ensure it covers all the known interpretation issues we have logged. It will then be implemented in the pipeline project. I would expect this whole process to be largely complete for public demo (not in production) by end 2018. This will include a test site with revisions to the https://www.gbif.org/occurrence/search interface. We expect to have to iterate over this during the testing phase (early 2019) before it moves to production.

I hope this helps provide some context and background information. We are working with the same community of publishers and with the same data and would very much like to see our efforts converge.

nickdos commented 5 years ago

Not sure its strictly DQ but it would be nice to had an admin & "flag an issue" interface on species pages. This came up when discussing how to handle DOI links for taxonomic descriptions on Names tab (work by Rod Page). We regularly receive support enquires from users telling us about an error or incorrect bit of information on species pages but there is no quick or easy way to fix these (has to be handled by code and a re-index 0 images being an exception). By being able to quickly fix such issues, the perception of DQ can be managed better, I think.

See https://github.com/AtlasOfLivingAustralia/data-management/issues/501#issuecomment-523242683