gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
127 stars 57 forks source link

Feature or API to update EML data for multiple datasets in cloud IPTs #1977

Open CecSve opened 1 year ago

CecSve commented 1 year ago

Help desk occasionally gets requests from publishers to be able to update EML files for multiple datasets at once - either programmatically or by development of a feature. This issue may be related https://github.com/gbif/ipt/issues/1973 but does not concern cloud IPTs.

Below are two concrete examples of the need for the feature/IPT API:

  1. we recently were asked to add a data manager as a contact to ~ 30 datasets that fell under the purview of a US federal program. The solution was to programmatically download and update the EML locally. However, the final step was a manual upload and then a manual publication of each dataset through the IPT.
  2. We recently updated a link to a project website for about 20 datasets. The solution was the same as before: download and update locally with a script, but then upload and publish manually.

I anticipate more of this happening as data accrues, so if we could gain programmatic access to the IPT it would make these updates a little less time-consuming.

It may be worth considering either:

  1. Implement a feature in the IPT to bulk edit metadata for selected datasets
  2. some sort of managing access through an IPT API to programmatically change metadata for selected datasets
mike-podolskiy90 commented 1 year ago

@CecSve Thank you Cecilie There is something to think about, I have no idea how to implement it yet

peterdesmet commented 1 year ago

Having to update the metadata in multiple datasets at once is a pretty common use case for me as well. I typically do this by opening the datasets in different tabs, editing them and publishing them. I guess it would be nice to see this supported in a bulk feature, but I'm not sure how it would look like. I think I mostly want to be able to update metadata without having to then also republish the datasets (cf. how it works in Zenodo), as it avoids having to republish for minor corrections/typos etc.

ckotwn commented 1 year ago

I can see this will be a productivity booster for organisations that manage its own dataset metadata and use IPT as the gateway facility to publish data to the GBIF Network. If the source metadata of many datasets receive new information, it is usually manually working with the IPT UI to accomplish all the updates.

If IPT has some kind of API and allows machine to interact, a script could easily save tons of time. I am keen to see such an interface created.

For implementation, an very-rough idea could be, allow IPT to receive a POST call with an EML file as the payload, then IPT can validate, diff the differences and ask for manager confirmation. But of course it may be more complicated than this.

dbloom commented 1 year ago

I like where @peterdesmet is headed in regard to NOT have to republish every time a minor detail is updated in the metadata.

Speaking for myself, I have had to update metadata in multiple datasets under the same publisher several times a year. Also, at least once a year (when it comes to Arctos) I have to update metadata across multiple publishers. Usually these changes are uniform across datasets (e.g., the removal of one contact person and replacement with a new person). I do this just as Peter has described above, by opening separate tabs, updating individually and then republishing. I don't know how this process could be streamlined, but it would be amazing to be able to make changes across more than one EML at a time, perhaps @ckotwn is onto something with the POST call, but my experience is very limited with such things. I would be happy to learn a new skill, however....

jdpye commented 1 year ago

I've got an interesting parallel potential use case here, OTN maintains a constantly-growing source database of machine-derived animal presence data, that will be the origin for hundreds of resource-level entries in an IPT. To publish everything equally well, i've created templates for eml.xml and meta.xml that each project's data fills in, and built extraction scripts that produce Occurrence Core and Event Core w/ Occurrence extension filesets for each project, depending on whether they are deploying tags, or tags and monitoring equipment. Zipping these up creates a valid (Thank you, GBIF Validator!) DwC-A file that I can use to create new resources on an IPT, with every piece of metadata and mapping falling nicely into place. What a time-saver!

But next time I go to update all of these DwC-As, I can't use a fully-packaged DwC-A file to update an existing resource with new eml.xml and potentially even small changes to meta.xml.

What I'd like to do, is be able to drop-in my updated DwC-As and have everything re-map and repopulate just as it did the first time I created the resource, but without having to delete anything.