Open M-Nicholls opened 6 years ago
Using IBRA/IMCRA is a great idea.
I'm looking closely at GBIF's new pipelines for potentially replacing our backend data ingestion of occurrence data. This is part of move to more closely align our data processing code with GBIF (essentially using the same codebase, but with ALA extensions).
This would affect the data quality tests we run. Happy to explain more.
re the pipelines - Nick dos has explained the intent to me briefly. Sounds like a good thing to work towards but what do you think the timeline is? As in, if the pipelines are not likely to be implemented for a year or more, I think we still need to do a local data quality campaign in the meantime.
Good question on the timeline. GBIF are actively working on it now....
From ALA's point of view, we'd need to use Pipelines and add some extensions for a few things ALA do (using Australian classification, sensitive data processing, sampling of layers etc). But the bulk of the darwin core interpretation should be common code.
cc @timrobertson100
Thanks @djtfmartin for alerting me to this issue.
I'd like to expand on some of the work from the GBIF Secretariat relating to this.
I would suggest we tackle data quality on several approaches concurrently. There are content issues we can address such as improving taxonomic reference catalogues while at the same time addressing coding issues (e.g. parsers), and actively engaging data providers to make changes.
We are very aware of the annoyance from publishers on how data is treated by "aggregators like ALA, GBIF and ..." and we believe this is a top priority to address. In some cases the criticism is perhaps misplaced (we lack documentation) but in other cases we clearly have much work to do. The presentations at the TDWG conference this year were a good reminder of how widespread this concern is.
Fundamentally, the target must be to clean data at source as much as possible. If clean before sharing on the internet then subsequent interpretations / flagging etc becomes less necessary.
The data team of GBIF are actively approaching publishers and trying to encourage and assist in cleaning data issues. We make use of our own issue flagging services (i.e. flags on https://www.gbif.org/occurrence/search) to locate problems and then engage over email. We also have the GBIF data validator which helps locate some issues (but is not itself without bugs). We also follow up directly with publishers when issues are reported through our feedback mechanisms as quickly as possible. I think it is fair to say the majority of publishers appreciate this.
Secondly, one of the most frequent issues reported relate to the backbone taxonomy and poor matches and we are striving towards quicker turnarounds of that taxonomy. When issues are reported, we capture them as "unit tests" for the next iteration of work to ensure we progress them.
Thirdly, we are working to improve our ingestion pipeline, as Dave points out. This is a fairly large re-architecture aiming to :
It is our hope that this eventually forms the codebase that powers the "ALA ingest" command and means we can handle data consistently across systems, and collaborate in the truest sense on some shared products. We are aware that such a goal will take time, compromises and changing our approaches.
We have just got the first end-to-end proof-of-concept of this pipeline running and can process all GBIF data into an Elasticsearch index in under 12 hours - this uses the same interpretation and flagging however as the current GBIF.org system. The next steps for us will be to document a specification for how we interpret all fields, which we will start next week and will solicit comments. This will include how we (will) change records, what flags we will apply (taking note of the TDWG DQ outcomes) and define the output formats we expect to offer. Much of this is documenting what we (and you) already do but we will also ensure it covers all the known interpretation issues we have logged. It will then be implemented in the pipeline project. I would expect this whole process to be largely complete for public demo (not in production) by end 2018. This will include a test site with revisions to the https://www.gbif.org/occurrence/search interface. We expect to have to iterate over this during the testing phase (early 2019) before it moves to production.
I hope this helps provide some context and background information. We are working with the same community of publishers and with the same data and would very much like to see our efforts converge.
Not sure its strictly DQ but it would be nice to had an admin & "flag an issue" interface on species pages. This came up when discussing how to handle DOI links for taxonomic descriptions on Names tab (work by Rod Page). We regularly receive support enquires from users telling us about an error or incorrect bit of information on species pages but there is no quick or easy way to fix these (has to be handled by code and a re-index 0 images being an exception). By being able to quickly fix such issues, the perception of DQ can be managed better, I think.
See https://github.com/AtlasOfLivingAustralia/data-management/issues/501#issuecomment-523242683
A couple of thoughts to start: