awslabs / rds-support-tools

Amazon RDS Support Tools contains utilities, sql, scripts and views which are useful in a RDS environment
https://aws.amazon.com/rds/
Apache License 2.0
217 stars 229 forks source link

Purging Data in RDS #45

Closed nmadhire closed 7 years ago

nmadhire commented 7 years ago

Are there any support tools to purge the old data from RDS and store into a snapshot? RDS should only have say 1 month of data at anytime and historical data should be purged somewhere.

robins commented 7 years ago

You should be using a time series data structure, which should allow you to delete old data pretty efficiently... Plus this method is pretty scalable to data warehouses too.

dholmer commented 7 years ago

I'll fill in the details for what robins suggests here. Your time series design depends on your database engine. If your engine supports partitioned tables with local indexes, then the easiest way is to partition your table on a CREATION_DATE field and put a no-update trigger on it to make sure your rows won't migrate to other partitions. Then create a database job to create more partitions as you need them, and TRUNCATE OLD partitions when you don't (never DELETE time series data). Use of local instead of global partitions ensures that your indexes remain healthy throughout the life of the table. Other tricks here include: (1) make CREATION_DATE the same date for the entire partition. For example, if you are partitioning by month, make CREATION_DATE the first day of the month for the entire partition. This will allow you to put a locally-indexed PRIMARY KEY where the primary key column (pkid) is the lead column and CREATION_DATE is the second column in the primary key. Otherwise, you could opt to not use a primary key at all. (2) Don't create foreign key constraints that reference the primary key of your partitioned table. It's really hard to manage these.

If your database engine doesn't support partitioned tables with local indexes, then you can create separate tables for each "partition". So one table per month, in your example. Then put a UNION ALL view on all. Then you write a database job that creates a new table when you need it, and includes it in the view, and drops the table when you don't (or better yet you can truncate if you want to resuse it).

That said, I understand what you're asking -- will we in support who maintain this repository be offering in-database scripts to help you set this up? We have talked about it, and we really do get the usefulness of this. But the answer is probably not soon. At the present time we are only offering read-only diagnostic scripts and some non-invasive utilities.