emory-libraries / aspace

0 stars 0 forks source link

Explore local backup repository #36

Closed erussey closed 1 year ago

erussey commented 1 year ago

Explore with Lyrasis how to create a backup repository for ArchivesSpace data. It should capture a monthly or quarterly snapshot of the data.

We also need to explore policy around how long these backups need to be maintained. Emory policy suggests they are permanent, but I don't know how practical this is or if it applies to backup files.

kbowaterskelly commented 1 year ago

Per their sysadmin, they have a solution to upload backups to an S3 bucket which is likely a fine solution. It seems their default is daily, so we may need to alter their script. Still need to look at our policies.

erussey commented 1 year ago

ASpace records fall into this category: Collection Catalog and Descriptive Records as well as Collection and Artifact Management of the retention policy document linked above: http://records.emory.edu/documents/libraries-archives-museum.pdf which indicates that we keep the data until no longer administratively necessary/permanently. The major question is how long a backup is administratively necessary. Note that part of the concern isn't just restoring data that was accidentally deleted, but also being able to track changed information in the case of theft, etc.

I'm sure we don't want to keep backups forever, but not sure what an appropriate line would be.

kbowaterskelly commented 1 year ago

I do not believe that this retention policy necessarily applies to our backups, except in case of data loss in production where backups might be our only copies. So we should be able to design a backup policy according to our own needs, as long as it is sane. I'm getting the impression that the amount of data is quite small due to this being metadata and a daily backup with one or two years of retention may be totally feasible at a minimal cost.

erussey commented 1 year ago

@kbowaterskelly : I think we can close this ticket once we have a size/cost estimate (and the broad outlines of a plan for how these backups will be captured and deleted and the frequency of those changes). I can then take that plan to project sponsors for approval.

kbowaterskelly commented 1 year ago

I have not been successful in determining the size of the data set. I'd estimate the work of implementing the backup, pulling down a copy of the data and assessing the size, planning a schedule, evaluating the cost basis, and writing a proposal at about a full day's work.

kbowaterskelly commented 1 year ago

From Lyrasis (Blake Carver):

We already have a system in place for setting up an S3 bucket. I just need to send over the details, including the keys, to someone there. Backups land in that bucket every night and you can just grab them whenever.

The bucket name is "privateupload" and your "emory" and backups land in, so the path would be "privateupload/emory/ArchivesSpace/Backups/" and then emory-YYYY-MM-DAY-as-prod-general-app1.lyrtech.org.sql.gz

So something like this to download a backup:

aws s3 cp --profile {{ EMORY }} s3://privateupload/emory/ArchivesSpace/Backups/emory-2023-01-13-as-prod-general-app1.lyrtech.org.sql.gz .

Or list them

aws --profile {{ EMORY }} s3api list-objects --bucket privateupload --

AGCooper commented 1 year ago

I created a spike ticket #89 for the cost estimate and documenting backup strategy. It's in the backlog right now.