gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
127 stars 57 forks source link

Allow auto-publication to abort when there are (far) fewer records #2133

Open peterdesmet opened 11 months ago

peterdesmet commented 11 months ago

Source files via URL + auto-publication is very useful for automatically publishing an active dataset. We use it for e.g. the following citizen science dataset: https://ipt.inbo.be/resource?r=dieren-planten-natuurpunt-occurrences

It would be useful however, if the IPT offered some options for aborting the auto-publication. The dataset above for example, has an issue in the pipeline, which resulted in far fewer records in the source file. This resulted in the (unintentional) deletion of many records at GBIF.org. It would have been nice if the IPT can detect this and abort the auto-publication.

mike-podolskiy90 commented 11 months ago

Thanks Peter for the suggestion. It sounds like a very sensible and useful feature.

mike-podolskiy90 commented 11 months ago

But what would be threshold for records drop? Should it be percentage or number?

peterdesmet commented 11 months ago

I suggest a threshold of 90% (hardcoded), but make it an optional setting when setting up auto-publication. That also leaves room for other options, without making it too complicated. Some of these options should probably not be optional (e.g. source data are missing), but always result in an error.

Enable auto-publication

  • [x] Abort when the number of records has dropped by 10%
  • [ ] Abort when mapped fields are missing in source data
dshorthouse commented 5 months ago

+1 for support. It would help prevent downstream snafus. The only issue I see here is the secondary need for notification of the abort(s) from the IPT, otherwise an affected dataset may sleep indefinitely in purgatory.

mike-podolskiy90 commented 5 months ago

Thanks @dshorthouse Email notification might be a very good idea here.

MattBlissett commented 5 months ago

Email notification would require additional configuration by the administrator — currently the IPT doesn't send any emails.

Having this within the IPT would avoid bad data being published, but having it detected by GBIF would allow easier email notifications and the helpdesk could be involved.

mike-podolskiy90 commented 5 months ago

@MattBlissett Actually, IPT does send emails, but not directly and via Registry. There is an option "Click here to contact organisation" and there is a link to send an organization token/password reminder. So we can probably implement that similarly.

dbloom commented 5 months ago

Having just been through this with @dshorthouse, I agree that an email notification would be very helpful - especially for those publications that initiated on an automated schedule (I may not see an issue for days otherwise). With nearly 180 resources publishing on a schedule knowing that an event was aborted or that the # of records was reduced (significantly), or both, would be very helpful. I also think it is important to be able to configure who receives these messages from within the IPT. The VertNet IPT, for example, has several admins, but not all need to, or should, received notices like this.

MattBlissett commented 5 months ago

@MattBlissett Actually, IPT does send emails, but not directly and via Registry. There is an option "Click here to contact organisation" and there is a link to send an organization token/password reminder. So we can probably implement that similarly.

I'd be reluctant for us to send emails triggered by systems (external IPTs) which we do not control. We could have IPTs with resources that are broken for months emailing users who don't want those emails (e.g. no longer work on the resource), and that risks GBIF's systems being considered spammy by Google, Microsoft etc.

TBC.