gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
124 stars 58 forks source link

Data directory back ups from IPT front end? #1770

Open rukayaj opened 2 years ago

rukayaj commented 2 years ago

With the uncertainty we had regarding the safety of the data on the Ukrainian IPTs, I was wondering whether it would make sense to have an easy backup mechanism: one button that you could push, which would zip the entire data directory (or maybe just the published datasets in the resource folder) and serve it up as a download. Maybe it could even be an endpoint in the IPT, like https://ipt.gbif.no/backup, to make it easier for people to write scripts to fetch backups for a list of IPTs - which is what we're thinking about doing for all Norwegian IPTs.

I suppose there could be one 'public' back up, which only backs up published datasets and one admin back up which could be a button in the administrative section which did the entire data dir.

Anyway, I think this would certainly be useful for those who don't have access to their IPT servers and have no easy way to back things up. If this seems like a sensible idea I would be happy to work on a PR for someone to review.

mike-podolskiy90 commented 2 years ago

Thank you @rukayaj I think this might actually be a useful feature

rukayaj commented 2 years ago

@mike-podolskiy90 I had a little look at this and I'm starting to wonder if it is actually a good idea. In our IPT the data dir is ~15 gig, too big (I would say) to zip up for download.

Maybe the back up should just consist of the latest files in each public dataset folder, and the source files (if there are any and it's not a database dataset)? Even then it could be big, so it should probably generate some split zip files which it stores, and a page where you can click to download them or request them in some way.

If it starts to work like this then it's getting fiddly to interact with, which makes me wonder if it's really going to be that useful. What do you think, and do you have any ideas for how it could work well?

mike-podolskiy90 commented 2 years ago

I think this should be configurable and flexible so users decide how much stuff to backup. So for big or relatively big resources it can be configured to backup minimally. I don't have any specific ideas on this though

rukayaj commented 2 years ago

Hmm ok. Then to cover our use case it would need a mechanism for selecting what to back up which can work programatically as well as via a UI. I think a good first start would be to make something which backs up everything for smaller sized IPTs, which will cover most cases, and then add to it after. I'm still interested in working on this, I'll be able to take a look at it the week after next.