denishpatel / pygmy

Pygmy: Saving AWS Bills on Standby DB Servers
5 stars 1 forks source link

pygmy

Pygmy - Saving on Cloud bills by Scaling Down Postgres Replica DB Servers

Pygmy is the tool for you!

Pygmy is a tool which takes advantage of predictable load patterns, resizing your replica dbs to something more appropriate at defined times of the day. When combined with pgBouncer, it can make sure clients never know you're saving money. Before taking any action it will check current activity and replication status, and it will call a script before and after stopping instances to play nicely with your environment. It will also call a script at the appropriate place when downsizing or upsizing cluster replicas to make sure no clients are impacted.


Pygmy runs on a central server, from which it manipulates the Postgres clusters it has been configured to know about. Setting up pygmy therefore involves both preparing your environment to be manipulated by pygmy, and setting up pygmy itself.

Environment setup

Make a db role

Pygmy requires a superuser role to be created on every Postgres DB it will be controlling. The role will be used to do things like observe load, connection counts, and replication lag.

Create a superuser role and make sure it can log into the postgres db on your postgres clusters you'd like to manage.

Install the EnterpriseDB system_stats extension

This extension gives a way to determine system load independent of EC2 or RDS.

Install this extension however you like on all members of your cluster. Make sure it is installed in the postgres db instead of your application's db, as that is what pygmy will be connecting to.

sudo apt-get -y install make gcc postgresql-server-dev-12 llvm clang

git clone https://github.com/EnterpriseDB/system_stats.git
cd system_stats/
PATH="/usr/lib/postgresql/13/bin:$PATH" make USE_PGXS=1
sudo PATH="/usr/lib/postgresql/13/bin:$PATH" make install USE_PGXS=1
psql -U postgres -d postgres
postgres=# CREATE EXTENSION system_stats;
CREATE EXTENSION

Pygmy setup

Set up a machine for pygmy to run on, making sure it has a postgres db on it for pygmy to keep its state.

Install system dependencies : E.g for ubuntu

$ sudo apt install build-essential python3-dev libpq-dev python3-virtualenv libxml2-dev libxslt1-dev

Create virtualenv

$ virtualenv -p python3 venv

Start Project

Activate virtualenv

$ source venv/bin/activate

Install all dependencies

$ pip install -r requirements.txt

Make Migrations

$ python manage.py makemigrations

Alter DB connection parameters.

Run Migration

$ python manage.py migrate

Populate settings data ( Run only one time at starting ) This will also enable

$ python manage.py populate_settings_data

Optional: You can enter secrets using following interactive command. If you don't do this, pygmy will depend upon an instance profile to talk to AWS and assume libpq will be able to do the necessary authentication for the pygmy db and for the dbs it will be managing.

$ python manage.py set_secrets

Load instance types data

$ python manage.py refresh_all_db_instance_types

create user from command line

$ python manage.py createsuperuser

Pygmy will scan instances which have following Tag-Value set

"EC2_INSTANCE_POSTGRES_TAG_KEY_NAME": "Role"
"EC2_INSTANCE_POSTGRES_TAG_KEY_VALUE": "PostgreSQL"
"EC2_INSTANCE_PROJECT_TAG_KEY_NAME": "Project"
"EC2_INSTANCE_ENV_TAG_KEY_NAME": "Environment"
"EC2_INSTANCE_CLUSTER_TAG_KEY_NAME": "Cluster"

Load instance data

$ python manage.py get_all_db_data

Make DNS, pre-resize, and post-streaming scripts at scripts/{dns-change,pre-resize,post-streaming}.sh

For each of these scripts,

Start local server ( Testing only )

$ python manage.py runserver

Prep Pygmy to run as a service...

Modify uwsgi.sgi as needed.

...via systemd

Copy pygmy.service to /etc/systemd/system/

Enable and start uwsgi via systemd:

sudo systemctl enable pygmy.service
sudo systemctl start pygmy

...via nohup

Start uwsgi ( Production )

$  nohup uwsgi uwsgi.ini &

Restart uwsgi ( Production )

$  pkill -9 uwsgi
$  nohup uwsgi uwsgi.ini &

API Cookbook

Get a list of all clusters

curl -s  http://127.0.0.1:8000/v1/api/clusters | jq '.'

[ { "id": 249, "name": "project-loadtest-1", "primaryNodeIp": "10.37.71.67", "type": "EC2", "enabled": "true" }, { "id": 252, "name": "project-loadtest-jobs1", "primaryNodeIp": "10.37.90.106", "type": "EC2", "enabled": "true" } ]

Manage cluster 249

In this case, project-loadtest-1

woops, add instance types

We can delete the old rule and make a new one

Validate

curl -s  http://127.0.0.1:8000/v1/api/cluster/management | jq '.'

[ { "id": 2, "avg_load": "12", "fallback_instances_scale_up": [ "m5.24xlarge", "m5.16xlarge" ], "fallback_instances_scale_down": [ "m5.large", "m5.xlarge" ], "check_active_users": null, "cluster_id": 249 } ]

woops, m5.xlarge isn't what we wanted, and we forgot users

We can also use PUT to modify a rule in place

Now manage cluster 252 (project-loadtest-jobs1)

curl -X POST http://127.0.0.1:8000/v1/api/cluster/management \
   -H "Content-Type: application/json" \
   -d '{
          "cluster_id": 252, 
          "avg_load": 2, 
          "fallback_instances_scale_up": ["m5.24xlarge","m5.16xlarge"], 
          "fallback_instances_scale_down": ["m5.large","m5.xlarge"],
          "check_active_users": ["project%","bench-rw"]
      }'

{"id":5,"avg_load":"2","fallback_instances_scale_up":["m5.24xlarge","m5.16xlarge"],"fallback_instances_scale_down":["m5.large","m5.xlarge"],"check_active_users":["project%","bench-rw"],"cluster_id":252}

Make a DNS entries

When Pygmy manipulates a db, it will also twiddle DNS to move load away from, or back to, that replica. We will need to record the CNAME that Pygmy will change, but also record how Pygmy will find that CNAME when it is working on a specific db. Pygmy supports two ways of matching an actual instance it is working on to a DNS entry:

curl -X POST http://127.0.0.1:8000/v1/api/dns \
   -H "Content-Type: application/json" \
   -d '{
          "match_type": "MATCH_ROLE",
          "tag_role": "Slave",
          "cluster": 249,
          "instance_id": null,
          "dns_name": "cluster1-slave-rrdns.project-loadtest.insops.net", 
          "hosted_zone_name": "n/a"
      }'

# OR

for id in $(curl -s  http://127.0.0.1:8000/v1/api/instances | jq '.[] | select(.cluster == 252 and .isPrimary == 'false') | .id'); do
 curl -X POST http://127.0.0.1:8000/v1/api/dns \
   -H "Content-Type: application/json" \
   -d '{
          "match_type": "MATCH_INSTANCE",
          "instance_id": '$id', 
          "tag_role": null,
          "cluster": null,
          "dns_name": "clusterjobs1-secondary.project-loadtest.insops.net", 
          "hosted_zone_name": "n/a"
      }'; done

Disable a cluster

If you have a need to temporarily tell pygmy to stay away from a given cluster (say your automated cluster failover script is in progress), you can simply disable the cluster.

To disable

curl -X PUT http://127.0.0.1:8000/v1/api/cluster/toggle/project-loadtest-1 \
   -H "Content-Type: application/json" \
   -d '{
          "enabled": "false"
      }'

To renable

If you have a need to temporarily tell pygmy to stay away from a given cluster (say your automated cluster failover script is in progress), you can simply disable the cluster.

curl -X PUT http://127.0.0.1:8000/v1/api/cluster/toggle/project-loadtest-1 \
   -H "Content-Type: application/json" \
   -d '{
          "enabled": "true"
      }'

Make a simple scaledown rule

This is a simple scale down rule with a scheduled reverse. It is likely a terrible idea, but shows how the various checks can be defined.

Get rid of a rule

Notice that its reverse rule also goes away

curl -X DELETE http://127.0.0.1:8000/v1/api/rules/12

Make a better rule

This is another simple example of a rule with a reverse, with more reasonable settings.

curl -X POST http://127.0.0.1:8000/v1/api/rules \
   -H "Content-Type: application/json" \
   -d '{ 
          "name": "Simple cluster1 control", 
          "typeTime": "CRON", 
          "cronTime": "0 * * * *",

          "cluster_id": 249,
          "action": "SCALE_DOWN",
          "ec2_default_type": "m5.2xlarge",

          "enableCheckConnection": "on",
          "selectCheckConnectionOp": "less",
          "checkConnection": "12",

          "enableReplicationLag": "on",
          "selectReplicationLagOp": "less",
          "replicationLag": "120",

          "enableAverageLoad": "on",
          "selectAverageLoadOp": "less",
          "averageLoad": "32",

          "enableRetry": "on",
          "retryAfter": "15",
          "retryMax": "3",

          "enableReverse": "on",
          "reverse_action": "SCALE_UP",
          "reverseCronTime": "30 * * * *"
    }'

Make a realistic rule

In most situations, we're going to want to be able to pre-emptively scale back up in case load returns earlier than we predict.

First the scaledown

curl -X POST http://127.0.0.1:8000/v1/api/rules \
   -H "Content-Type: application/json" \
   -d '{ 
          "name": "jobs1 downsize", 
          "typeTime": "CRON", 
          "cronTime": "0 * * * *",

          "cluster_id": 252,
          "action": "SCALE_DOWN",
          "ec2_default_type": "c5.large",

          "enableCheckConnection": "on",
          "selectCheckConnectionOp": "less",
          "checkConnection": "2",

          "enableReplicationLag": "on",
          "selectReplicationLagOp": "less",
          "replicationLag": "60",

          "enableAverageLoad": "on",
          "selectAverageLoadOp": "less",
          "averageLoad": "2",

          "enableRetry": "on",
          "retryAfter": "5",
          "retryMax": "3"
    }'

Make a scheduled scaleup rule for jobs1

We don't care about checks here - when the time comes, make sure the cluster is embiggened. Note that when scaling up, we'll make replicas tagged with the role of Backup into a c5.xlarge, instead of a c5.2xlarge.

curl -X POST http://127.0.0.1:8000/v1/api/rules \
   -H "Content-Type: application/json" \
   -d '{ 
          "name": "jobs1 deadline upsize", 
          "typeTime": "CRON", 
          "cronTime": "55 * * * *",

          "cluster_id": 252,
          "action": "SCALE_UP",
          "ec2_default_type": "c5.2xlarge",
          "ec2_role_types": ["Backup:c5.xlarge"],

          "enableRetry": "on",
          "retryAfter": "5",
          "retryMax": "3"
    }'

Make a pre-emptive scaleup rule for jobs1

This rule starts soon after the scaledown has completed, and repeatedly checks if the load has gotten too high or if replication has fallen behind. If so, then scale up before we normally would.