MeltanoLabs / target-postgres

MIT License
10 stars 15 forks source link

Feature request - support for column encryption with `pgcrypto` module #181

Open menzenski opened 11 months ago

menzenski commented 11 months ago

We are using target-postgres with Postgres 13 on AWS RDS Aurora.

A lot of the data that we are moving from upstream sources into Postgres has been encrypted in the source system, and can't be decrypted within Postgres after it is loaded.

I have written a custom mapper/transform plugin using the SDK that can decrypt all of these records during meltano run, but I don't want to write them to the database as plain text - there is sensitive data in there (PII) and I would like to keep this data encrypted at rest.

What I'm hoping to achieve is something like this: for a given target-postgres plugin instance, configure it with a column name and an encryption key, and have the tap use the pgp_sym_encrypt function of the pgcrypto module.

I'm happy to try to contribute this feature myself but I wanted to open an issue first and see if there is a best way to make this change.

visch commented 11 months ago

Curious one for sure. Possible of course but not sure the best implementation method.

My first idea hops to stream maps but that doesnt' work here as stream maps are in memory transforms.

It's like instead of stream maps, you want SQL column maps. Or something. The idea being you could modify the SQL used to populate the TEMP table before getting upsertted.

Maybe the config would look something like

sql_column_transform:
   - "sink_name":
       "column_name": pgp_sym_encrypt(column_name, encryption_key)

This would theoretically allow us to let this work with a number of other use cases that I can't imagine right now but also seem a bit overkill.

Easiest thing to do today with a postgres feature is probably to encrypt using a stream map with python which bypasses the need to develop this but I don't know how important using postgres native functionality is to you and others!

menzenski commented 11 months ago

Easiest thing to do today with a postgres feature is probably to encrypt using a stream map with python which bypasses the need to develop this but I don't know how important using postgres native functionality is to you and others!

This is basically what we have today "out of the box" - the issue is that it's "easy" to decrypt with Python in a stream map, but it's not possible to decrypt these fields after they've been written to the database. I think if we were hosting Posgtgres ourselves and had full access to create new Python UDFs, that it would be possible, but AWS RDS does not allow that access.

visch commented 10 months ago

but it's not possible to decrypt these fields after they've been written to the database.

You can use any encryption mechanism you'd like right so some of them you can decrypt like AES or something and you could definitely decrypt it.

When you pull the data out of the postgres db do you need to be able to decrypt it right in the DB? Or can you consume it with python again to decrypt it? https://stackoverflow.com/questions/56311405/aws-rds-postgres-crypto-functions-doesnt-work-even-with-the-pgcrypto-extension is interesting