Data backups and dwca archival

rukayaj commented 2 years ago

I suggest we have a test run where we pretend we have had our servers burned or something, and we try to reconstruct all our different services.

Technical information about the UiO servers: https://www.uio.no/tjenester/it/hosting/servere/beskrivelse.html

Ability to backup content in cloud services such as MS Azure, MS365, Google and AWS.

Back up as a service: https://www.uio.no/tjenester/it/hosting/baas/ <- I think we have this maybe? But I have never tried to retrieve data from it. I will send them an email and ask.

rukayaj commented 2 years ago

When it comes to backing up all the other Norwegian publisher's IPTs - I think we should get in touch and ask them what their current disaster management backup practices are, and if they would like this dwca archival thing as a service.

For Oleh and the Ukraine we should also see what he actually wants, but my feeling is that part of hosting an IPT should be the responsibility to back up your own data, and if you aren't able to do that we should help with the IPT hosting.

dagendresen commented 2 years ago

If the storage space is not impossible to handle, I think that maybe we simply go ahead and create a backup routine for all the Norwegian IPTs -- no need to ask?

I agree that backup is part of the responsibility when hosting your own IPT. But I also think that a second or third (?) safe back of datasets could be argued to be a relevant responsibility for a national GBIF node... ;-)

rukayaj commented 2 years ago

Well, obviously it is not impossible, but it would mean that we need more space. I don't think we can handle it on our current servers.

So if we want to do this then, step 1 would be to get a new server with more storage space, or we pay for a bucket on the cloud. Maybe the cloud storage makes more sense as it might be a bit safer than the UiO servers? We should scope out how much that might cost.

We can also discuss how the script should work:

Should it just be a job which runs on a server like a cron job without any kind of web interface? Or do we want it to have a web interface that has a button where we can manually trigger the downloads + back ups, as well as automatically scheduling them? Do we want a UI to view the dwca backups, which divides up dwcas per publisher, and displays info about the dataset, e.g. name etc? We could also just have the dwcas hosted on an ftp server, or even just privately available to us on the server's file system. Do we want to allow extra stuff like the ability to choose specific publisher's dwcas to back up as an ad hoc job?

What should happen if a dwca is not available for download, for whatever reason? How many dwcas do we keep, just the latest one or the last 2 or 3 say? We should build the service so this is configurable, I think. What happen if we have a dwca for a dataset which we find has now been deleted, should we delete our 'back up', or should we just log it in some way?

How are we going to monitor this service to make sure it's functional? We could just make it email us every time it successfully runs?

If we are expanding this for other countries, would we run those separately as different services (e.g. one for the Ukraine), or do we just combine it all and make the script allow us to add as many countries as we like?

We should also think about how to prioritise this new project. I think it would probably take say two or three weeks or so to do it properly, depending on what other work we have to juggle at the same time. What do you think @MichalTorma ? We had planned to work on the annotation system this week, but maybe this should supersede that?

Finally, I think even if we are not going to ask the other hosting institutions if they need this, we should at least tell them that this service exists or else they might have some data loss and failed back up and not realise that we can help them get it back.

By the way, I have emailed USIT about our own server's back ups.

MichalTorma commented 2 years ago

I would say the best setup would be: Offsite backup into a bucket (even if UIO has the backup strategy in place, redundancy is good here) Price depends on number (and type) of operations. For example 1TB of cold data storage available in multi-region (europe) bucket including 1 million operations (e.g. insert) on Google cloud would cost ~ $17/month When it comes to how to do the backup, I’d like small Django container just to keep track of what and when the backup is run (plus some metadata) plus some list of identifiers of what exactly to backup... the container would be occasionally run (say once a month?) this might be overkill but it’s just a way I think nowadays :D anyway - you’re the server boss here Rukaya, you know what design fits the contemporary architecture :)

M

On 17 Jan 2022, at 16:53, Rukaya @.***> wrote:

Well, obviously it is not impossible, but it would mean that we need more space. I don't think we can handle it on our current servers.

So if we want to do this then, step 1 would be to get a new server with more storage space, or we pay for a bucket on the cloud. Maybe the cloud storage makes more sense as it might be a bit safer than the UiO servers? We should scope out how much that might cost.

We can also discuss how the script should work:

Should it just be a job which runs on a server like a cron job without any kind of web interface? Or do we want it to have a web interface that has a button where we can manually trigger the downloads + back ups, as well as automatically scheduling them? Do we want a UI to view the dwca backups, which divides up dwcas per publisher, and displays info about the dataset name etc? We could also just have it hosted on an ftp server, or just privately available to us on the server's file system. Do we want to allow extra stuff like the ability to choose specific publisher's dwcas to back up as an ad hoc job?

What should happen if a dwca is not available for download, for whatever reason? How many dwcas do we keep, just the latest one or the last 2 or 3 say? We should build the service so this is configurable, I think. What happen if we have a dwca for a dataset which we find has now been deleted, should we delete our 'back up', or should we just log it in some way?

How are we going to monitor this service to make sure it's functional? We could just make it email us every time it successfully runs?

If we are expanding this for other countries, would we run those separately as different services (e.g. one for the Ukraine), or do we just combine it all and make the script allow us to add as many countries as we like?

We should also think about how to prioritise this new project. I think it would probably take say two or three weeks or so to do it properly, depending on what other work we have to juggle at the same time. What do you think @MichalTorma https://github.com/MichalTorma ? We had planned to work on the annotation system this week, but maybe this should supersede that?

Finally, I think even if we are not going to ask the other hosting institutions if they need this, we should at least tell them that this service exists or else they might have some data loss and failed back up and not realise that we can help them get it back.

— Reply to this email directly, view it on GitHub https://github.com/gbif-norway/helpdesk/issues/83#issuecomment-1014683888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJEWI7NOBQDQZ3PI5EACE3UWQ3RPANCNFSM5MER2HUQ. You are receiving this because you were mentioned.

rukayaj commented 2 years ago

USIT have replied, they say that we need to email restore@usit.uio.no, and our servers are backed up as VMs. Their policy is to house the back ups in a different building. Full backup of the Vm once a week and incremental every day.

rukayaj commented 2 years ago

They just sent another email: Backups of extra versions and deleted file are kept for 90 days

rukayaj commented 2 years ago

From biomeddata - we can also keep in mind https://archive.norstore.no/ NIRD as a data storage solution. See also https://kommunikasjon.ntb.no/pressemelding/sigma2-anskaffer-ny-nasjonal-lagringsinfrastruktur?publisherId=17847635&releaseId=17924495

rukayaj commented 2 years ago

There is now a new and updated cost model for Sigma2 services -- such as the NIRD storage we talked about for the GBIF node data archive and/or backup services.

https://www.sigma2.no/user-contribution-model

We have a budget line in the GBIF project that should cover such costs -- if we choose to use NIRD.

I think the new model means that also storage needs below 10 TB will now need to pay something.

Non-commercial category A: 443 NOK per TB -- presumably per year ??

rukayaj commented 2 years ago

I noticed this - "Employees at UiO can only use cloud services provided by suppliers the university has data processing agreement with." From https://www.hf.uio.no/english/services/it/research-and-dissemination/storage-solutions/. Not sure how seriously we should take it?

dagendresen commented 2 years ago

As long as we use project funds from RCN --> I think not very serious at all ;-)

MichalTorma commented 2 years ago

I don't see them being able to provide what we need. Also, as Dag says - it's not UIO's money :)

dagendresen commented 2 years ago

Draft Sigma2 storage application https://docs.google.com/document/d/1r_IHEmdjyY8TWksu_VPQvZVXx0E-NT1E25TCU7aNcz4/edit#

dagendresen commented 2 years ago

GBIF Norway has been granted storage space at Sigma2 NIRD. Granted for 2022.1 to 2023.2 (two years) - renewable https://www.metacenter.no/mas/projects/NS8095K/

gbif-norway / helpdesk

Data backups and dwca archival #83