Offsite database backups and sanitization

jameswilson commented 6 years ago

Create a drupal:backup task:

[x] Take a database dump and create a gzipped tarball of files folder.
- leverage drush sql-dump; this will be generalized and refactored later to support more platform types, for now we're hardcoding drupal8 support.
- Suggested filename formats:
  - [spark:name]-[env:ENVIRONMENT]-YYYY-MM-DD--HH-MM-SS.sql.gz for database backups.
  - [spark:name]-[env:ENVIRONMENT]-YYYY-MM-DD--HH-MM-SS.tgz for filesystem backups.
  - Prefer down-cased filenames for simplicity.
  - Use timestamp (HH-MM-SS) in filename to avoid same-day overwrites.
  - Use app ENVIRONMENT environment variable in filename to ensure backups for multiple environments do not conflict if sent to the same s3 bucket. The ENVIRONMENT variable may be stored in the .env file in project root which is ignored by git and available to Robo via vlucas/phpdotenv. On Platform.sh environment variables can be specified using platform-cli
  - Use application/project name variable to ensure backups for multiple projects do not conflict if sent to the same s3 bucket. The name variable should be stored in the application's [.spark.yaml](https://github.com/BluesparkLabs/spark-example/blob/master/.spark.yml) file.
[x] Copy database dump and files tarball to S3 bucket.
- Leverage aws/aws-sdk-php
- Store environment-specific configs in .env file file in project root, ignored by git, and use vlucas/phpdotenv to load it into spark.
- Store AWS credentials in~/.aws/credentials and config in ~/.aws/config
- Leverage “profiles” i.e. bracketed groupings to maintain multiple keys for different projects.
- Or use IAM role assigned to an EC2 instance, where profiles are not needed.
- Add a spark config option for aws_profile
[x] Remove old backup files from S3.
- Delete s3 dumps older than >15 days (only concerned with grabbing files with YYYY-MM-DD-part of the filename, ignore HH-MM-SS timestamp).
[ ] Create sanitized, GDPR-compliant database dumps.
- Refactor the dump task to allow both a normal dump and a sanitized dump.
- Augment normal drush sql-dump commands with machbarmacher/gdpr-dump and the GDPR Drupal module to perform sanitization on the fly.
- Note: These methods leverage ifsnop/mysqldump-php and fzaninotto/Faker.
- We'll need the ability to specify the gdpr-replacements parameter that denote which db tables and columns to sanitize to be stored in .spark.yml, but then converted to the required JSON format in the spark command task so it can be passed to the gdpr-dump command.
```
gdpr-replacements: 
 - tableName
     columnName1:  
       formatter: formatterType
       ... 
     columnName1:  
       formatter: formatterType
       ... 
```
[ ] Upload sanitized GDPR-compliant database dumps to s3.
[ ] Write a command to sync db dump from s3
- option to use the pristine or sanitized version.
- option to only download the database or to download and reload (drush sql-drop && drush sql-cli < dump.sql)

jameswilson commented 6 years ago

@balintk , Related issue for IULD8 project: https://bluespark.atlassian.net/browse/IULD8-403 where I'm proposing we start by adding a DrupalS3Backup.php robo command file, with two features:

1) handle drush-based sql-dump and sync to s3 with awscli. 2) handle clean-up of s3 to ensure we retain no more than 15 days of backups.

I've asked @citlacom to mention issue IULD8-403 in the commits to this repository so we can clearly see for what project this was done, similar to what Balint and Jose did for the SO project.

Later on, we can refactor / extend DrupalS3Backup.php to leverage ifsnop/mysqldump-php so we can sanitize the backups, in a parallel backup step (because we'll need both pristine database dumps, as well as sanitized ones).

jameswilson commented 6 years ago

I've implemented the first step here, by adding commands to take the sql-dump and to create a tarball of the files folder.

jameswilson commented 6 years ago

I've Implemented the second step here by adding the aws-php-sdk dependency and custom functionality to upload the db dump and files tarball to S3.

jameswilson commented 6 years ago

I've implemented the third step here to clean up old dumps. I've added a new option called --keep to specify some amount of time for which to keep backup files around on S3. The default value is 15 days for GDPR compliance but can be overridden in .spark.yml.

The command now looks like this:

The code adds the following command:

composer spark  drupal:backup  <options>

Options include:
--bucket    (Required) The S3 bucket destination for the backup.
            E.g. 'bsp-myproject'

--region    The AWS region to connect to. If left blank, the
            default value is 'us-east-1'. For a list of available
            regions, see http://bit.ly/s3-regions.

--profile   The AWS profile to use for connection credentials.
            Default value is 'default'. The AWS SDK will first
            try to load credentials from environment variables
            (http://bit.ly/aws-php-creds). If not found, and if
            this option is left blank, the SDK then looks for
            the default credentials in the `~/.aws/credentials`
            file. Finally, if you specify a custom profile
            value, the SDK loads credentials from that profile.
            See http://bit.ly/aws-creds-file for formatting info.

--keep      A string representing a relative amount of time to
            keep backups. The string must be parsable by PHP
            `strtotime`. The default value is '15 days', which
            is the recommendation for GDPR compliance. Files
            found in the backup folder on S3 that are older than
            this time will be removed.  New files uploaded to S3
            will have an Expires value set to now plus the
            specified time.  WARNING: be very careful modifying
            the value of this option as it can and will delete
            existing backups.

--truncate  A comma-separated list of tables from which to
            truncate values in the db dump. This maps to the
            drush sql-dump --structure-table-list option.
            Default value is 'cache,cache_*,sessions,watchdog'.

--skip      A comma-separated list of tables from which to
            exclude from the db dump.

--files     A string or array of paths/to/files/or/folders to
            include in the tarball. Paths should be relative
            to the project root directory and not to the webroot.

--exclude   A string or array of filenames/foldernames to
            exclude from the tarball.

These command options can be specified inside the .spark.yml file, like so:

command:
  drupal:
    backup:
      options:
        bucket: bsp-myproject
        keep: 15 days
        truncate: cache,cache_*,sessions,watchdog
        skip: migrate_*
        files:
          - web/sites/default/files
          - private
        exclude:
          - css
          - js
          - styles
          - xmlsitemap
          - backup_migrate
          - ctools
          - php

jameswilson commented 6 years ago

Assigning to @balintk for final review, and moving the remaining tasks (db sanitization) to a follow-up issue so this one can be closed for now. Thanks!

balintbrews commented 6 years ago

Extracted all the remaining work—see referenced issues. This one is good to close, thanks for the great work here, @jameswilson.

BluesparkLabs / spark

Offsite database backups and sanitization #8