Large schema cache reloads cause 100% CPU spike when schema cache metadata is accessed.

colophonemes commented 1 year ago

Environment

PostgreSQL version: 13.10 (also tested 13.11)
PostgREST version: Known issue since v9.0, recently tested in 11.1.0 and 11.2.1
Operating system: MacOS, Docker (ECS)

Description of issue

Our production instance of PostgREST exhibits the following behaviour:

Schema cache is reloaded (either via app restart or NOTIFY pgrst reload)
Things work as normal for a few minutes
Something actually needs to look at the updated schema cache (the most reliable way to reproduce is to load the OpenAPI spec from the / endpoint, but the behaviour seems to eventually happen regardless)
PostgREST starts spinning CPU, pegging at 100% for about five mins
- If you triggered the spike by attempting to load the OpenAPI schema endpoint, we at first get a partial response (always 135,063 bytes in my testing) that loads within a few seconds, with a truncated form of the JSON (it's not a valid JSON object that contains fewer entries, but a longer response that has been truncated). Eventually this response will time out
- Subsequent reloads of the OpenAPI schema endpoint trigger a CPU spike of similar duration
- There's usually two spikes (one per OpenAPI refresh), and then by the end of the second one, the OpenAPI schema request finishes buffering and the full JSON loads; occasionally there'll be a third spike that's a bit shorter than the previous two, and the OpenAPI schema resolves after that one.
Once the OpenAPI schema request has loaded fully, subsequent requests return ~immediately, and there are no more CPU spikes, until such time as the schema cache reloads again.

Here's me reproducing the issue in a local Docker container, connected to our production DB. The time it takes for the CPU spike to calm down is shorter, presumably because my laptop is beefier than the ECS instances we run our production PostgREST on.

We've observed this issue since at least v9.0.0 (though possibly it was present earlier), through to the present day. I've just run through the testing steps above on both v11.1.0 and v11.2.1, and the behaviour is there in both.

Possibly an interaction with TimescaleDB?

My guess about why this happens in production but not dev is that we use TimescaleDB for a number of our tables that contain time-series data. This means that the number of tables in our schema changes dynamically as more data is added to parent tables, and Timescale chunks these into child tables. The child tables are stored on a _timescaledb_internal schema, e.g. _timescaledb_internal._hyper_1_12345_chunk. Some of our larger tables have upwards of 500 of these chunks attached.

Formatting and running the tablesSqlQuery from PostgREST/SchemaCache.hs results in ~350 entries against a dev database, and nearly 13,000 entries against prod. If I add AND n.nspname !~ 'timescale' to the end of that query then the prod query drops back to around 300 entries.

So, our guess is that our production PostgREST instances are choking on this massive amount of data when processing the schema cache (looking at the network graph in the image above, it looks like it's transferring about 10MB over the wire to reload the schema).

Possible mitigations

I guess that there are at least three possible mitigations that I can think of here:

Exclude child tables from the schema cache (maybe gated under a config var; not sure if this has implications for resource embedding/FK detection)
Special-case _timescaledb_internal and other Timescale schemas so they don't appear in the tables query (seems a bit gross, but maybe routes around needing to handle edge cases for the more general case of child tables)
Add a config variable that allows us to exclude specified schemas from the schema cache (probably safer + more general)

Obviously all these mitigations rely on my being correct that the root cause of the issue is the large number of child tables present in the tables query.

That said, I'm wondering if there is in fact some perf bug here? (13k rows / 10MB of data is a lot relative to most schemas, but also seems like something that should maybe be processable. But I don't have any idea about the how the combinatorics of adding additional tables scales here.)

wolfgangwalther commented 1 year ago

If I add AND n.nspname !~ 'timescale' to the end of that query then the prod query drops back to around 300 entries.

Is this timescale schema part of your search path, i.e. via db-extra-search-path? This would cause this schema to be scanned for the schema cache. If it's not in the search_path it should normally not be scanned at all.

colophonemes commented 1 year ago

Is this timescale schema part of your search path, i.e. via db-extra-search-path?

Sorry, forgot to mention that in the original post – no, we only have api in db-schemas, and then the schemas that api is an overlay on in db-extra-search-path (app_schema_1, app_schema_2 etc).

The tables on e.g. app_schema_1 etc do themselves have TimescaleDB child tables, but these aren't hosted in app_schema_1, but live on their own _timescaledb_internal schema.

user@db_host=> \d+ app_schema_1.some_table

                                       Table "app_schema_1.some_table"
┌──────────┬──────────────────────────┬───────────┬──────────┬─────────┬─────────┬──────────────┬─────────────┐
│  Column  │           Type           │ Collation │ Nullable │ Default │ Storage │ Stats target │ Description │
├──────────┼──────────────────────────┼───────────┼──────────┼─────────┼─────────┼──────────────┼─────────────┤
│ ts       │ timestamp with time zone │           │ not null │         │ plain   │              │             │
│ thing_id │ bigint                   │           │ not null │         │ plain   │              │             │
│ value    │ numeric                  │           │ not null │         │ main    │              │             │
└──────────┴──────────────────────────┴───────────┴──────────┴─────────┴─────────┴──────────────┴─────────────┘
Indexes:
    "some_table_thing_id_ts_idx" btree (thing_id, ts DESC)
Check constraints:
    "positive_value" CHECK (value > 0::numeric)
Triggers:
    ts_insert_blocker BEFORE INSERT ON some_table FOR EACH ROW EXECUTE FUNCTION _timescaledb_functions.insert_blocker()
Child tables: _timescaledb_internal._hyper_1_1_chunk,
              _timescaledb_internal._hyper_1_2_chunk,
              _timescaledb_internal._hyper_1_3_chunk,
              _timescaledb_internal._hyper_1_4_chunk,
              _timescaledb_internal._hyper_1_5_chunk,
              _timescaledb_internal._hyper_1_6_chunk,
              _timescaledb_internal._hyper_1_7_chunk,
              ... etc etc
Access method: heap

wolfgangwalther commented 1 year ago

Ah right, my bad. We only filter for extra search path schemas in the allViewsKeysDependencies query - not in the tablesSqlQuery:

https://github.com/PostgREST/postgrest/blob/f10d4139feb8ff2cd619126a14bc4df559e36e25/src/PostgREST/SchemaCache.hs#L952

vs.

https://github.com/PostgREST/postgrest/blob/f10d4139feb8ff2cd619126a14bc4df559e36e25/src/PostgREST/SchemaCache.hs#L784

This is because we might detect base tables in those views that are outside the search path (probably the case most of the time). IIRC we discussed a potential for future optimization by just querying the tables detected via the views query in the tables query. Seems like you have hit that limitation right there.

colophonemes commented 8 months ago

@wolfgangwalther Hey - just wanted to check in to see if this is something that is likely to be worked on in an upcoming release of PostgREST? If not, how feasible do you think it would be for me as a relatively inexperienced Haskell developer to dive in an try to submit a PR fixing this (and if this is the route we decide to go, do you have a preference for how a fix is implemented)?

wolfgangwalther commented 8 months ago

just wanted to check in to see if this is something that is likely to be worked on in an upcoming release of PostgREST?

No plans from me regarding this in the near future. Too many other things on my list I'd like to do.

If not, how feasible do you think it would be for me as a relatively inexperienced Haskell developer to dive in

I would say in general the area around the schema cache should be easier to get started with than other areas - because big chunks here are actually SQL, not haskell. So if you know SQL well, that should make it easier.

if this is the route we decide to go, do you have a preference for how a fix is implemented?

I have not thought in detail about this. Reducing the size of the schema cache / improving the performance of the schema cache queries is always a good thing.

I'm not exactly sure how, but one way to do this could be to merge the allViewsKeysDependencies and tablesSqlQuery queries - so that we will never have to "query views first, then pass them to the tables query".

steve-chavez commented 8 months ago

Since https://github.com/PostgREST/postgrest/pull/3213, we now log some schema cache stats:

20/Feb/2024:18:27:24 -0500: Schema cache queried in 12.7 milliseconds
20/Feb/2024:18:27:24 -0500: Schema cache loaded 264 Relations, 221 Relationships, 141 Functions, 15 Domain Representations, 45 Media Type Handlers

It would be interesting to know the metrics for your case @colophonemes.

If there's a large number of relations, perhaps we should exclude _timescaledb_internal from the schema cache?

colophonemes commented 8 months ago

@steve-chavez Heres the output from when I load using that build:

21/Feb/2024:16:23:22 +0000: Starting PostgREST 12.1 (1a141c1)...
21/Feb/2024:16:23:22 +0000: Attempting to connect to the database...
21/Feb/2024:16:23:22 +0000: Successfully connected to PostgreSQL 13.13 (Ubuntu 13.13-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit
21/Feb/2024:16:23:22 +0000: Listening on port 3000
21/Feb/2024:16:23:24 +0000: Listening for notifications on the pgrst channel
21/Feb/2024:16:23:24 +0000: Config reloaded
21/Feb/2024:16:24:04 +0000: Schema cache queried in 640.9 milliseconds
21/Feb/2024:16:24:04 +0000: Schema cache loaded 99 Relations, 230 Relationships, 16 Functions, 0 Domain Representations, 4 Media Type Handlers

And here's some output from ps at 5s intervals while it's starting:

ELAPSED  %CPU %MEM
  00:01   2.0  0.2
  00:06   0.0  0.2
  00:11   1.6  0.3
  00:17   1.5  0.3
  00:22   1.1  0.4
  00:27   0.0  0.4
  00:32   0.0  0.4
  00:37   7.6  0.5
  00:42   1.3  0.5
# 21/Feb/2024:16:24:04 +0000: Schema cache queried in 640.9 milliseconds
  00:47 100.1  0.6
  00:52 100.1  0.6
  00:57  98.0  0.6
  01:02  98.2  0.6
  01:07 100.1  0.6
  01:12  99.1  0.6
  01:17 100.1  0.6
  01:22 100.1  0.6
  01:27 100.1  0.6
  01:32  99.1  0.6
  01:37 100.1  0.6
  01:42 100.1  0.6
  01:47  98.7  0.6
  01:52 100.1  0.6
  01:57  99.1  0.6
  02:02 100.1  0.6
  02:07 100.1  0.6
  02:12 100.1  0.6
  02:17  99.7  0.6
  02:22  98.9  0.6
  02:27  98.1  0.6
  02:32 100.1  0.6
  02:37 100.1  0.6
  02:42 100.1  0.6
  02:48 100.1  0.6
# 21/Feb/2024:16:24:04 +0000: Schema cache loaded 99 Relations, 230 Relationships, 16 Functions, 0 Domain Representations, 4 Media Type Handlers
  02:53   3.2  0.6
  02:58   0.0  0.6

steve-chavez commented 8 months ago

@colophonemes Aha. So that means the schema cache queries are not the problem since they're fast. The problem is the schema cache parsing. This was previously discussed on https://github.com/PostgREST/postgrest/issues/2450#issuecomment-1931254688.

wolfgangwalther commented 8 months ago

Aha. So that means the schema cache queries are not the problem since they're fast. The problem is the schema cache parsing.

While this is right, this does not necessarily mean that "optimizing the queries" as discussed above is wrong. IIRC, we fetch a lot more data for the schema cache - and then filter out the stuff we don't need in haskell code, right?

colophonemes commented 8 months ago

Yeah, I agree that it seems like it's the parsing rather than the DB query (indeed, I was able to extract and manually run the SQL queries myself and they were very fast), but I think I agree with @wolfgangwalther that if we could cut down the number of relations we're feeding into the parsing step, then parsing would go a lot faster.

If there's a large number of relations, perhaps we should exclude _timescaledb_internal from the schema cache?

Yeah, this seems like a straightforward solution, it could be a flag that's sort of the opposite of db-extra-search-path, giving a list of schemas that you want to exclude. But perhaps it's just better to go with the suggestion of cutting out more cruft from the tables queries directly? 🤷‍♂️

steve-chavez commented 8 months ago

@colophonemes Could you share your schema (DDL only) privately? (email on profile) Then I can work on improving the query or maybe the Haskell code.

steve-chavez commented 8 months ago

And here's some output from ps at 5s intervals while it's starting:

That time now should be clearer on main branch due to https://github.com/PostgREST/postgrest/pull/3253

colophonemes commented 8 months ago

Could you share your schema (DDL only) privately? (email on profile) Then I can work on improving the query or maybe the Haskell code.

Unfortunately I don't think that I can that easily. It's not super-straightforward to dump out the Timescale stuff in a way that's restorable on your end. I'll try to prepare a toy schema that results in the same behaviour.

steve-chavez commented 8 months ago

Did some digging using the big_schema:

$ PGRST_DB_SCHEMAS="apflora" postgrest-with-postgresql-15 -f test/io/big_schema.sql postgrest-run

28/Feb/2024:13:33:24 -0500: Starting PostgREST 12.1 (160caaf)...
...
28/Feb/2024:13:33:24 -0500: Schema cache loaded 326 Relations, 305 Relationships, 6 Functions, 0 Domain Representations, 4 Media Type Handlers
28/Feb/2024:13:33:42 -0500: Schema cache loaded in 17141.6 milliseconds

I'm not exactly sure how, but one way to do this could be to merge the allViewsKeysDependencies and tablesSqlQuery queries - so that we will never have to "query views first, then pass them to the tables query".

The problem is not related to the tablesSqlQuery, which is fast. It's only isolated to the relationships

https://github.com/PostgREST/postgrest/blob/97cd559265a22603b029893533524d8e44618f9a/src/PostgREST/SchemaCache.hs#L167

By not computing the relationships and just replacing it with a mempty the times gets reduced to:

28/Feb/2024:13:37:07 -0500: Starting PostgREST 12.1 (160caaf)...

28/Feb/2024:13:37:07 -0500: Schema cache loaded 326 Relations, 0 Relationships, 6 Functions, 0 Domain Representations, 4 Media Type Handlers
28/Feb/2024:13:37:07 -0500: Schema cache loaded in 1.4 milliseconds

Some profiling also indicates that it's not the query but the different tranformations we do on Haskell code that take a lot of time and CPU:

https://github.com/PostgREST/postgrest/blob/97cd559265a22603b029893533524d8e44618f9a/src/PostgREST/SchemaCache.hs#L163

It should be possible to do that logic on the db instead but it will take a fair amount of work.

Edit: addM2MRels and addViewM2OAndO2ORels take most of the time.

wolfgangwalther commented 8 months ago

The problem is not related to the tablesSqlQuery, which is fast. It's only isolated to the relationships

I expected the query itself to be fast. But if we fetch fewer tables, the relationship calculation in haskell should be faster, right?

PostgREST / postgrest