Open alukach opened 2 years ago
I think this is an easy win for squeezing some more performance out of the smaller RDS instance classes. Using the formula approach should keep things tuned up with database resizing operations, too. Great idea @alukach! There are some pretty good docs on the topic, too.
If you play around a bit with pgtune, most of the parameters are only affected by the DBInstanceClassMemory
attribute, but work_mem
depends on DBInstanceClassMemory
AND DBInstanceVCPU
.
@bitner has some hints for settings in a few places in pgstac: https://github.com/stac-utils/pgstac/blob/858a984e2aad77d7a77cfe52ac988aedd726ad08/docker/pgstac/dbinit/pgstac.sh https://github.com/stac-utils/pgstac/blob/858a984e2aad77d7a77cfe52ac988aedd726ad08/src/pgstac/sql/001_core.sql#L248-L293
I successfully added some parameter group settings to a pgstac db for my foss4g demo stack: https://github.com/developmentseed/eoapi-devseed/commit/32c23c8acc23514fe79ea28a635dcc8967eea706#diff-cb48cee8202ee3a079eadbf36d979abe36adc0a6c83739265e57dbe0f08de556R38-R70 I hard-coded the values in a dictionary, but I like the formula approach more.
The only tricky part is that the units for each parameter are not obvious (e.g. 8 kb for shared_buffers
).
If we want to implement this we should just need to identify a more appropriate set of default settings, maybe based on pgtune
as a starting point.
The biggest wildcard with any formula that is used is that they all should be dependent on a realistic setting for max_connections which should be driven by the actual expected number of max_connections. work_mem and max_connections must ALWAYS be set in tandem and the general rule for a starting value is that max_connections work_mem should generally be less than 1/4 of the system memory (this is also affected by the parallel process settings as well, if you are on a system with high settings for parallelism and a lot of vcpus, then work_mem max_connections should be even lower).
I think it is good for us to have better starting point for eoAPI, but the one thing that I want to make sure we are VERY clear about in any documentation is that these automated settings are only a starting point and that any postgres database should be actively reviewed (and note that the best settings for any instance may change over time as well!!).
I'll admit that I'm a little confused about where this discussion is going. This ticket was written to track the idea of us moving from automated tuning of the DB based on DB size at time of creation (introduced in #1) to dynamic automated tuning based on DB size throughout the DB's lifetime (ie tuning would be reactive to DB resizing).
Our current tuning logic can be found here: https://github.com/developmentseed/eoapi-cdk/blob/b7f3ce659f2e61cc9b8e6efbf991067ffb654efb/lib/database/index.ts#L133-L170
Any of these values are overridden by any user-provided DB parameters (props.parameters
): https://github.com/developmentseed/eoapi-cdk/blob/b7f3ce659f2e61cc9b8e6efbf991067ffb654efb/lib/database/index.ts#L42-L59
If we want to implement this we should just need to identify a more appropriate set of default settings, maybe based on
pgtune
as a starting point.
@hrodmn Were you aware of our current tuning functionality? Have you seen large discrepancies between what our tuning vs what pgtune
would suggest?
@bitner I wasn't aware of this logic: https://github.com/stac-utils/pgstac/blob/858a984e2aad77d7a77cfe52ac988aedd726ad08/docker/pgstac/dbinit/pgstac.sh#L7-L16
Would this override any tuning done on the part of CDK?
The settings in pgstac.sh only affect the docker database image and should be good enough for running tests or spinning up example instances.
As you know, I am still very weary of making settings too easy as ANYTHING we do is going to be just trying to be better than default. If you are using Postgres in production, you should always periodically (NOT just when you create or resize an instance) be revisiting your settings and indexes and the more we try to hide some of that away, the more people aren't going to even realize that that is something they should be doing.
Particularly, the setting that I think that any of our automated approaches completely miss the mark on is that we do not require manual setting of max_connections. It is great that we allow using it as an override parameter, but this setting is sooo critical to how we do anything else, I really think that we should require setting it manually.
That is also the problem that I have with setting things via CDK is that it really seams to emphasize to me that the settings are a "set once" kind of thing when you create the instance. When someone does go back to do some tuning, do they have to go back to adjust the CDK? Just make those changes on the live database?
Were you aware of our current tuning functionality? Have you seen large discrepancies between what our tuning vs what pgtune would suggest?
No, I did not realize we were already tuning the parameters depending on the instance size. That's great. I still think it's a good idea to set the values using a formula for the reasons you outlined, though.
The main thing I have observed is that the default t3.micro
instance gets completely overwhelmed by titiler-pgstac. A single web map client will push enough traffic to the database with the Lambda setup to hit the max_connections
limit. A t3.small
with the default settings seems able to handle it pretty well, though.
Another reason max_connections MUST be set reasonably. It is far far far better to reject a query due to max_connections than to let it go through and swamp the database memory/cpu (given that everyone's first instinct is to just keep retrying things that failed and if it fails due to OOM, that just means that the situation will get worse and worse). Adding a pooler like pgbouncer can really help this as rather than just rejecting the connection, it can wait for a connection to become available rather than just rejecting it.
That being said, expecting to be able to throw a lambda at micro or even a small database instance is really really asking a lot! You have something on one side that will scale like crazy as soon as there is a load trying to hit a very small static resource. This is inherently just a recipe for disaster without something like a pooler to do some traffic control. I would actually argue that if deploying with Lambda or anything that autoscales, that a single pooler like pgbouncer is a MUST for any postgres database in production.
Currently, database parameters are hardcoded values that are based on instance memory. However, AWS supports using formulas to dynamically set these values based on instance properties (docs). This way, if an instance resizes, database parameters scale with the instance. It would be good for us to make use of these formulas to make more flexible parameters.
However, this comes with challenges as we base parameters on other parameters. The formulas supported by AWS must be simple (only a single division or multiplication, nothing like
{DBInstanceClassMemory/1024/8/4}
. As such, we may need to figure out a way to parse the formulas to understand their meaning and then build dependent formulas based on those meanings.