NYCPlanning / ae-data-flow

Data pipelines to populate Application Engineering databases
1 stars 0 forks source link

Experiment with foreign data wrappers #49

Open TangoYankee opened 1 month ago

TangoYankee commented 1 month ago

Description

Our data flows depend on the etl database knowing the structure of the target application database. One possible approach to syncing the data between these two databases is a foreign data wrapper. This ticket is to explore their use. The end result should be a draft pull request with an example of leveraging these tools and a PR description that describes its use. A comment below has a quick brief

If the experiment goes well, we can use it for the ETL of the Capital Planning Explorer prototype

Acceptance Criteria

TangoYankee commented 1 month ago

The ty/fdw branch has a sql/tax-fdw.sql file with code to make a foreign data wrapper (fdw).

FDWs behave in a similar way to shallow copies of objects in javascript. With shallow object copies, it is a reference to the original object. Updating the copy also updates the original. We can use FDWs to apply this paradigm to tables across databases. Imagine we have Database A (the target database) with Table Foo. We want to transform some data in Database B (the ETL Database) and use the resulting data to populate Database B. We can use a FDW wrapper to create a reference in Database B to the Foo Table in Database A. The reference enforces the schema and constraints of the Foo table, without needing to fully replicate it in Database B. Unfortunately, there are limitations to this approach. First, the SQL command directly uses the password. There seem to be ways to replace the direct usage with an environment credential but more research is required. However, the bigger issue relates to custom types. The database creating the FDW needs to have all of the data types used in the target table. This means we need to define our enums within the ETL database. This isn't as much of a maintenance burden as defining the whole table schema. However, it does undermine the main benefit- not needing to maintain schemas in two different places.

I can still think of three alternatives to Foreign Data Wrappers: 1) PG Dump the schema from the target Database to the ETL database. Though, this may still fail to account for custom types 2) Use the drizzle schemas to migrate the ETL database 3) Literally use the Target database as the ETL database, as well- perhaps separating the source tables into their own schemas to help keep things clearly separated