SPIKE: Find out if Aurora or RDS can export to PostgreSQL native format

ConradBoydElliottGustafson commented 2 years ago

We want to know if we are "locking in" to a particular flavor of postres or if we could move our data to something else (Azure, on-prem, etc.)

Create an aurora db, put some data in it, export, import
Same for RDS

AC: Try to export from Aurora and import into a local PostgreSQL db Try to export from RDS PostgreSQL and import into a local db Report the results and discuss with PO if it impacts our architecture

rparfig commented 2 years ago

Used famdb (Dev) as a source for testing aurora db backups

The snapshots (or backups) reside within the RDS cluster and they seem to be only recognizable by RDS. However, there is an option to export the snapshot to S3 but in Parquet format, which cannot be used to restore the DB to a vanilla Postgres Database
In an Aurora serverless deployment, there is no access to the command line, so pg_dump cannot be invoked. pg_dump could be used to export backups directly to S3, as ad-hoc backups and as not as a replacement of the out-of-the-box maintenance tasks (for practicallity)

rparfig commented 2 years ago

Backing up to S3 using pg_dump https://dba.stackexchange.com/questions/221454/best-way-to-pg-dump-postgresql-on-rds-to-s3

rparfig commented 2 years ago

Created an RDS instance called "rds-temp-postgres" on Dev environment (sitting next to famdb), using the Free-Tier, as:
DB Identifier: rds-temp-postgres
Role: Instance (as opposed to famdb which is Serverless)
Engine: PostgreSQL (as opposed to Aurora PostgreSQL)
Version: 13.4 (as opposed to 10.18 - which is expected as it is a different product)
Size: db.t3.micro (as opposed to 0 capacity units for famdb)

Results:
Able to describe the instance with AWS CLI from terminal: aws rds describe-db-instances --db-instance-identifier rds-temp-postgres
Snapshots were automatically setup by RDS, with a retention of 7 days
Instance was automatically set for minor version upgrades
Instance was not accessible externally to run Postgres backup tools, nor pgAdmin

Errors: postgres@CA-L3H3ZPN2://mnt/c/Clients/FSA/git$ psql --host=rds-temp-postgres.cp9oqzf51oiq.ca-central-1.rds.amazonaws.com --port=5432 --username=postgres --password --dbname=rds-postgres Password:

psql: error: could not connect to server: Connection timed out Is the server running on host "rds-temp-postgres.cp9oqzf51oiq.ca-central-1.rds.amazonaws.com" (10.12.53.118) and accepting TCP/IP connections on port 5432?

basilv commented 2 years ago

While RDS instance isn't accessible externally, AWS RDS allows one to do manual or automatic snapshots and restore as desired from snapshots to new or existing database instance. I'm not sure if the snapshot can be exported as a file (to e.g. use locally), that would be worth testing.

rparfig commented 2 years ago

@basilv : That is correct. Through my testing, those snapshots are Aurora RDS snapshots and only usable by RDS. As an alternative, you can save the snapshot to S3, but even if you download it from S3, it will be a Parquet format file, not a postgres-compatible backup file.

basilv commented 2 years ago

Sorry, thought you were talking above about managed RDS rather than Aurora.

rparfig commented 2 years ago

Update: Conversations with the AWS Team confirmed our findings. Aurora snapshots are not to be used for restoring non-Aurora Postgres DBs.

Alternatives:

Setup EC2 with vanilla Postgres
Use AWS Glue to extract data to another Postgres DB when needed. Although Glue is intended for transformations, it could serve the purpose.
The best option, and recommended by AWS is to explore using AWS DMS (Data Migration Service). AWS reps have confirmed that the service not only serves as an aid to migrate databases to the cloud, but also to other platforms, including on-prem.

Investigating DMS:

The first step is to create a replication instance, which will serve as a task executor, or orchestrator.
Attempting to create a replication instance fails with this error, that has been raised to the Pathfinder team via Rocketchat:

Richard Pardo-Figueroa @rpardofigueroa 8:31 AM hello team: I am facing this warning when trying to create a Replication instance using AWS DMS (Database Migration Service):

DMS requires access permissions to manage your VPC resources. By clicking Create replication instance, you grant permission for DMS to create a role that has this access.

followed by this error:

User: arn:aws:sts::521834415778:assumed-role/BCGOV_WORKLOAD_developer_umafubc9/rpardofi@idir is not authorized to perform: iam:GetRole on resource: role dms-vpc-role because no identity-based policy allows the iam:GetRole action

Could someone provide me with some assistance on resolving this issue? Is it perhaps because DMS does not seem to be available as a AWS Service, approved by the Pathfinder team? or I need elevated privileges to allow the DMS service to create a role?

ArogeG commented 2 years ago

@rparfig is investigating why DMS can't see the FAM DB. consider allowing Security groups, inbound and outbound flow from the DB.

basilv commented 2 years ago

DMS would need to query the FAM DB so this means the FAM DB security group inbound flow needs to allow access from DMS, while DMS security group outbound flow needs to allow access to FAM. But usually outbound flows are unlimited, so it is likely just the inbound flow for FAM DB that needs to be configured.

basilv commented 2 years ago

Had a session with Richard and got things working.

Learnings from the AWS configuration - there's a set of initial security groups created as part of the initial environment. The database should be placed in the data_sg security group, and the app in the app_sg security group, which then allows network access via the inbound rule in data_sg allowing traffic from resources in the app_sg security group.

To get DMS to work, it needed to be placed in the app_sg security group.

The existing FAM aurora database has a dedicated security group created by terraform that likely isn't needed.

My access in AWS (as a developer role) did not allow me to even see any of the RDS database instances. I could see all the security groups, but was unable to see the details of the terraform-created security group (e.g. couldn't see the inbound rules, while I could see the rules of the initially created groups like data-sg). Richard with his admin role was able to see all of this.

rparfig commented 2 years ago

Testing was successfully completed using DMS, copying the FAM database to a PostgreSQL DB also hosted in RDS AWS.

Challenge:

Any snapshot in RDS is stored in a proprietary format. The snapshot can be exported to S3, but it remains as a Parquet file.
The new database is not accessible outside of the AWS tenancy. Attempting to create an egress rule fails with "operation not allowed", perhaps due to OCIO's security policies.

Action:

Given that we have proven that DMS can be used to copy the structures and data of an RDS database to a vanilla progress, we can assume that this migration method will work. However, we will try the following:

Create an EC2 instance (micro, free tier) with AWS Linux or RedHat Linux (does not matter)
Install postgres
Create an empty database
Migrate FAM to this newly created database
Export database to S3 storage using pg_dump
Attempt to download the file and restore it to a local database.

rparfig commented 2 years ago

Alternative suggested by @ConradBoydElliottGustafson :

Run an ECS instance (a EC2 instance as a service) with Postgres. It allows for sending commands that would otherwise be executed in the command line. The output could be written to S3 and will avoid the creation of a permanent EC2 instance, as ECS is ephemeral.

Actions:

Create EC2 instance/PostgreSQL/DMS migration/S3 export/download/restore to local
Create ECS instance/PostgreSQL/DMSmigration//S3 export/download/restore to local

rparfig commented 2 years ago

Created an EC2 instance with Amazon Linux, using the free tier: I was able to access the console and install Postgres, but was unable to figure out the networking settings. Attempts to change security rules between security groups caused errors. Dropped the instance.
Created another EC2 instance with Amazon Linux, using the free tier, in the security group "App" (the same security group as the DMS instance). I am unable to access the command line to install Postgres. Investigating....

rparfig commented 2 years ago

Reached out to the Amazon team for assistance, explaining the issue with not being able to run the shell in the newly created EC2 instance. Awaiting response

rparfig commented 2 years ago

In a new attempt to connect to the instance, I found that the Session Manager is now enabled. The instance was created almost 24 hours ago and it is strange to see that suddenly became enabled.
Started shell and installed Postgres executing the following commands successfully:

$ sudo amazon-linux-extras install epel $ sudo tee /etc/yum.repos.d/pgdg.repo<<EOF [pgdg13] name=PostgreSQL 13 for RHEL/CentOS 7 - x86_64 baseurl=http://download.postgresql.org/pub/repos/yum/13/redhat/rhel-7-x86_64 enabled=1 gpgcheck=0 EOF $ sudo yum install postgresql13 postgresql13-server $ sudo /usr/pgsql-13/bin/postgresql-13-setup initdb $ systemctl status postgresql-13 $ sudo su - postgres $ psql postgres=# create database famdb; CREATE DATABASE postgres=# create schema app_fam; CREATE SCHEMA

rparfig commented 2 years ago

Firewall seems to allow all traffic from subnet, but facing this issue:

Test Endpoint failed: Application-Status: 1020912, Application-Message: Cannot connect to ODBC provider Network error has occurred, Application-Detailed-Message: RetCode: SQL_ERROR SqlState: 08001 NativeError: 101 Message: [unixODBC]could not connect to server: Connection refused Is the server running on host "ip-10-12-102-245.ca-central-1.compute.internal" (10.12.102.245) and accepting TCP/IP connections on port 5432?

rparfig commented 2 years ago

Reached out to the OCIO folks as well as to the AWS team. No response from OCIO, waiting for AWS to show availability or book something for us, as Brent is quite booked up.

Started from scratch, after realizing that there is a setting that I was unintentionally ignoring, that allowed me to create an EC2 outside of the Application subnet.

Created an EC2 on the App subnet, installed Postgres with an empty database. Unable to connect to it from DMS.
Created an EC2 on the Data subnet, installed Postgres with an empty database. Unable to connect to if from DMS. This should have worked without issues as the Data subnet has the port 5432 open for Ingress

DMS resides in the App group, so it should be able to connect to anything on the App subnet, but 5432 is not open in that subnet, so the first EC2 was expected to fail.

basilv commented 2 years ago

You can always add a new security group with a custom inbound rule to allow the traffic from DMS. But I agree that the second scenario should have worked.

rparfig commented 2 years ago

Tried a different approach. Using the EC2 running on the APP Security Group, I tried to run pg_dump against the RDS Aurora Postgres DB and it worked:

-bash-4.2$ pg_dump -h fam-aurora-db-postgres.cluster-cp9oqzf51oiq.ca-central-1.rds.amazonaws.com -p 5432 -Fc -U sysadmin famdb > app_fam_db.dump
Password:
-bash-4.2$ ls -ltr
total 52
drwx------ 4 postgres postgres    51 Jul 27 22:14 13
-rw-r--r-- 1 postgres postgres 51075 Aug  9 00:06 app_fam_db.dump

Did not work:


-bash-4.2$ pg_dumpall -U sysadmin -h fam-aurora-db-postgres.cluster-cp9oqzf51oiq.ca-central-1.rds.amazonaws.com -p 5432 --clean --file=fam_and_alldbs.dump
Password:
pg_dumpall: error: query failed: ERROR:  permission denied for relation pg_authid
pg_dumpall: error: query was: SELECT rolname FROM pg_authid WHERE rolname !~ '^pg_' ORDER BY 1

-bash-4.2$ psql psql (13.7) Type "help" for help.


postgres=# create user fam_proxy_api;
CREATE ROLE
postgres=# create database app_fam_restored;
CREATE DATABASE
postgres=# \q
-bash-4.2$ pg_restore -d app_fam_restored app_fam_db.dump
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 5; 2615 118795 SCHEMA app_fam sysadmin
pg_restore: error: could not execute query: ERROR:  role "sysadmin" does not exist
Command was: ALTER SCHEMA app_fam OWNER TO sysadmin;

Errors are caused because RDS does not have a postgres user we can use (or available credentials). It uses "sysadmin" instead and the restore throws warnings. However, the restore works perfectly.

-bash-4.2$
-bash-4.2$ psql -d app_fam_restored
psql (13.7)
Type "help" for help.

app_fam_restored=# \d app_fam.fam_application;
                                          Table "app_fam.fam_application"
         Column          |              Type              | Collation | Nullable |             Default
-------------------------+--------------------------------+-----------+----------+----------------------------------
 application_id          | bigint                         |           | not null | generated by default as identity
 application_name        | character varying(100)         |           | not null |
 application_description | character varying(200)         |           | not null |
 create_user             | character varying(30)          |           | not null |
 create_date             | timestamp(6) without time zone |           | not null | CURRENT_DATE
 update_user             | character varying(30)          |           |          |
 update_date             | timestamp(6) without time zone |           |          | CURRENT_DATE
Indexes:
    "fam_app_pk" PRIMARY KEY, btree (application_id)
    "fam_app_name_uk" UNIQUE CONSTRAINT, btree (application_name)
Referenced by:
    TABLE "app_fam.fam_application_group_xref" CONSTRAINT "reffam_application20" FOREIGN KEY (application_id) REFERENCES app_fam.fam_application(application_id)
    TABLE "app_fam.fam_role" CONSTRAINT "reffam_application22" FOREIGN KEY (application_id) REFERENCES app_fam.fam_application(application_id)
    TABLE "app_fam.fam_application_client" CONSTRAINT "reffam_application31" FOREIGN KEY (application_id) REFERENCES app_fam.fam_application(application_id)

app_fam_restored=# select * from app_fam.fam_application;
 application_id | application_name |  application_description  |  create_user  |     create_date     |  update_user  |     update_date
----------------+------------------+---------------------------+---------------+---------------------+---------------+---------------------
              1 | fam              | Forests Access Management | fam_proxy_api | 2022-07-20 00:00:00 | fam_proxy_api | 2022-07-20 00:00:00
              2 | fom              | Forest Operations Map     | fam_proxy_api | 2022-07-20 00:00:00 | fam_proxy_api | 2022-07-20 00:00:00
(2 rows)

app_fam_restored=#

rparfig commented 2 years ago

Meeting with the Pathfinder team scheduled for Monday Aug 15th@3PM, to follow up on configuration issues on our name plate or perhaps in the internal configuration of our AWS Tenancy and the security guardrails set in place.

fadekemiatitebi commented 2 years ago

@rparfig please can you include me in this meeting. In the meantime, are we able to stop the two EC2 instances if we are not using them until next week after discussions with the pathfinder team.

rparfig commented 2 years ago

@fadekemiatitebi : I was careful on choosing the Amazon Free Tier for these EC2 instances, so we should not be charged for these services. However, I have shutdown both EC2s. I have also forwarded you the invite for the meeting with the Pathfinder team on Monday

fadekemiatitebi commented 2 years ago

Thanks @rparfig, It looks like there are some EC2 charges, although not significant. If you still need the instance, you can keep it running. Otherwise stopping it may not be a bad idea since we can always restart it.

rparfig commented 2 years ago

Issue has been solved, with the assistance of the Pathfinder team, during the session held on Aug 15, 2022:

postgresql.conf should be changed to include: ssl = on listen_addresses = "*"
Restarting the PostgreSQL service, allowed for successful traffic from DMS to the EC2 instance.

Outstanding / pending for decision:

Configure policies to export to S3 and make a backup to write into S3 bucket. It is possible to export to a S3 instance, but a policy to allow this has not yet been configured.
Create a certificate to allow DMS to open a SSL connection to the PostgreSQL instance: Given that this Spike has met its goal, this may not be necessary as it would be possible to prescind of DMS.

gormless87 commented 2 years ago

@rparfig - 'prescind' - thanks for the vocabulary expansion

bcgov / nr-forests-access-management

SPIKE: Find out if Aurora or RDS can export to PostgreSQL native format #33