gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
128 stars 58 forks source link

Pipelines data validator integration #1635

Open fmendezh opened 3 years ago

fmendezh commented 3 years ago

Integrating the IPT and Data Validator can help publishers to improve data before publishing it into GBIF, the data validator provides a consistent API with the running data ingestion platform, such API provides the necessary services to validate Occurrence, Checklist, and Metadata only datasets.

Basic functionality

  1. Once a dataset/resource contains the desired metadata and its data has been uploaded or mapped, the user desires to validate it before publishing it to GBIF.
  2. The IPT generates a DwC-A in a staging location accessible as an external URL and through the Data Validator API requests to validate it.
    • This can also be accomplished by using the Validator API to upload a file.
      • The authentication method must respect the implemented procedures for the IPT.
  3. The Data Validator starts the validation process, returns the validation key for the requested archive, which will be used to track its progress.
  4. Upon successful validation, the IPT should allow the user to publish the resource into GBIF.

Additional considerations

  1. The IPT must provide a way to track the validation progress of an individual resource.
  2. Multiple validation requests for the same resource must be prevented to happen by allowing only one validation running at a time per resource. The Data Validator, already imposes a suggested maximum validation a single user can run in parallel.
  3. Once validation has finished the IPT must delete all temporary files and elements created.
  4. For the IPT shouldn't be necessary to store other information than the validation identifiers executed for each resource, a specific endpoint for IPT validation can also be considered to relieve the IPT of storing additional data.
spalp commented 2 months ago

Wow, thanks to @ckotwn, I just became aware of this incredibly useful feature. Cannot wait to see it in production. Meanwhile, I added a step for the publisher in the documentation suggesting them to manually check their data using the IPT. Here's the commit: https://github.com/gbif/ipt/compare/master...spalp:ipt:patch-2 I hope it makes sense.