Open denishpatel opened 3 years ago
Ohai! Just some clarifications where I communicated poorly:
I actually don't want to tag our dbs with the open port; all our dbs listen on the same port so this isn't something we'd find useful. I was thinking, if the port tag isn't defined, we could just have it default to port 5432. But we do have 3 tags (Project, Environment, and Cluster) that we use to uniquely define our clusters. When naming a cluster during autodiscovery, it would probably be good to have a definable template, eg "{{Project}}-{{Environment}}-{{Cluster}}" would work perfectly for us, but other people might want a different tag hash.
When it comes to killing connections on scale up/down events, we're going to need to keep track of more than just the number of connections - the connection db role will also be important. Different roles have different importance.
I used "midnight to 4am" a lot when I was talking, but that was just hypothetical. That might be perfect for some of our clusters, but others might be "3pm to 4am". And of course that's local time - with clusters all over the world, we'll want to define these windows per cluster, and in UTC.
@benchub noted. Makes sense.
For the tag: yes, if you don't provide ports tag, it will use default port 5432. However, you need to tag "type:pg-instance" so we can identify EC2 instances running postgres cluster. We can easily accomodate Project, envionment and cluster tags and name the cluster accordingly.
Rule criteria
We can ignore historical data analysis rule for now. We can tackle it later. Just keep all metrics available to use it in future release.
time
system Load average on Ec2 instance (uptime command).. should we use cloudwatch or SQL function (preferred)?. For RDS, we have to use cloudwatch metric to get 15min average on load. Collect load average metrics from both primary and secondary so we can make a decision when to scale down replica server or if there are multiple replica servers. For example: primary server 15 min load average is 10.. other two replica servers as load average 5 & 5 respectively. this means primary can able to handle load from both replica servers so we can scale down both replica servers otherwise we will only scale down 1 replica server. -- https://www.enterprisedb.com/blog/monitoring-postgresql-database-system-activities-performance-system-stats-extension
replication lag
keep track of # of connections as during scale down.. we might have to kill some connections