denishpatel / pygmy

Pygmy: Saving AWS Bills on Standby DB Servers
5 stars 1 forks source link

Demo notes #9

Open denishpatel opened 3 years ago

denishpatel commented 3 years ago
  1. The cluster name can be defined from tag.. i.e projectname. They use mostly EC2 instances to running clusters. they like the idea of tagging pg-instance and ports
  2. Rule criteria

    • We can ignore historical data analysis rule for now. We can tackle it later. Just keep all metrics available to use it in future release.

    • time

    • system Load average on Ec2 instance (uptime command).. should we use cloudwatch or SQL function (preferred)?. For RDS, we have to use cloudwatch metric to get 15min average on load. Collect load average metrics from both primary and secondary so we can make a decision when to scale down replica server or if there are multiple replica servers. For example: primary server 15 min load average is 10.. other two replica servers as load average 5 & 5 respectively. this means primary can able to handle load from both replica servers so we can scale down both replica servers otherwise we will only scale down 1 replica server. -- https://www.enterprisedb.com/blog/monitoring-postgresql-database-system-activities-performance-system-stats-extension

      postgres=# SELECT * FROM pg_sys_load_avg_info();
      load_avg_one_minute | load_avg_five_minutes | load_avg_ten_minutes | load_avg_fifteen_minutes
      ---------------------+-----------------------+----------------------+--------------------------
              0.04 |                  0.05 |                 0.02 |
      (1 row)
    • replication lag

    • keep track of # of connections as during scale down.. we might have to kill some connections

  3. Scheduler
    • Let's assume we want a replica for specific cluster to go down every day from midnight to 4am in the morning except some specific day/dates
    • Most of the calls will be done through API
  4. Baeseline
    • the replica will always scale down from it's baseline state during the window
    • the scale up instance will always be baseline size instance
  5. Replica-only flag - global pygmy installation level
    • to prevent someone to scale up/down primary instance accidentally
  6. Alerting
    • if the Ec2 instance type is not available then choose the next instance type but alert by email and show on alerting tab on GUI
benchub commented 3 years ago

Ohai! Just some clarifications where I communicated poorly:

  1. I actually don't want to tag our dbs with the open port; all our dbs listen on the same port so this isn't something we'd find useful. I was thinking, if the port tag isn't defined, we could just have it default to port 5432. But we do have 3 tags (Project, Environment, and Cluster) that we use to uniquely define our clusters. When naming a cluster during autodiscovery, it would probably be good to have a definable template, eg "{{Project}}-{{Environment}}-{{Cluster}}" would work perfectly for us, but other people might want a different tag hash.

  2. When it comes to killing connections on scale up/down events, we're going to need to keep track of more than just the number of connections - the connection db role will also be important. Different roles have different importance.

  3. I used "midnight to 4am" a lot when I was talking, but that was just hypothetical. That might be perfect for some of our clusters, but others might be "3pm to 4am". And of course that's local time - with clusters all over the world, we'll want to define these windows per cluster, and in UTC.

denishpatel commented 3 years ago

@benchub noted. Makes sense.

For the tag: yes, if you don't provide ports tag, it will use default port 5432. However, you need to tag "type:pg-instance" so we can identify EC2 instances running postgres cluster. We can easily accomodate Project, envionment and cluster tags and name the cluster accordingly.