Room retention *increases* size of state tables.

This issue has been migrated from #9406.

We have reports that enabling room retention in a very large room (with frequent joins and leaves) actually causes the size of the state tables to increase, which is surprising.

Note that it appears that postgres is vacuuming the tables.

Also note that this is an experimental feature.

Hello, very possibly the report came from a discussion I initiated on some Matrix admin related rooms, since we were facing such an issue and trying to get to the bottom of it before opening a thread on GitHub. We are however unable to pinpoint the exact source of defect, so here are as many details as I can collect, seeking your help.

We are running a dedicated HS for bridges. Thus most rooms are initiated by appservices, have many (1-20k) local users from appservices and a handful remote users from (multiple) consuming HS (1-20).

We recently decided to enable a retention policy on the server, deleting history, including local history, older than 6 months. Our rationale was guided by our ToS clearly stating we are not entitled to storing bridged or cached data for more than 6 months, plus a need to purge old events and associated media, which have stacked TB of data from bridging high activity Telegram rooms for instance.

We are aware this might interfere with history purge on other homeservers: a seemingly non destructive operation of purging remote history elsewhere might become destructive because when paginating, the remote HS will get empty history from us. We are okay with this from a user experience perspective.

The failure

We woke up to a full drive, which filled overnight, so we were unable to respond in time and postgres was now read-only with no room for vacuuming.

Our bridges database had grown to over > 400 GB. Culprit was clearly state management, with state_groups_state growing x10 in rows and x4 in size (had not been vacuumed for a while). No state autocompressor was running at the time, since we usually wait for a couple days/weeks after big configuration changes before we switch synatainer back on.

The facts

After failing to identify anything obvious in the logs or data, not finding any open issue on the matter, we pulled the database to a high performance machine for analytics, we gathered metrics from rooms and room state in hope of uncovering a pattern.

Room count was stable.

bridges=# select count(*) from rooms;
 count 
-------
  8734

Storing state group entries was involved indeed, here the total row count. No precise metric about previous value, but we believe it was around 50M, so this is a 10-times increase overnight.

bridges=# select count(*) from state_groups_state ;
   count   
-----------
 591294362
(1 row)

After creating a couple views for investigating, we checked for the frequent issue of orphaned state groups:

bridges=# select cast(sum(case when r.room_id is null then 1 else 0 end) as float) / sum(1) from investigation_rooms_by_state_groups i left join rooms r on i.room_id=r.room_id;
      ?column?      
--------------------
 0.4645638879948503
(1 row)

bridges=# select cast(sum(case when r.room_id is null then cnt else 0 end) as float) / sum(cnt) from investigation_rooms_by_state_groups i left join rooms r on i.room_id=r.room_id;
       ?column?       
----------------------
 0.044112372240072195
(1 row)

They do indeed make up for half the rooms represented in state groups, yet only 5% of the rows, so we discarded it as a culprit and decided to come back and clean this later.

We sorted rooms by number of state group entries (not state groups). We frequently monitor those when trying to slim down our database, so we were able to identify that top 10 rooms from the previous week were now barely making top 50. Top 3 rooms were showing 50M, 50M and 40M state group entries respectively, so we decided to focus on top 1 and see if any pattern applied to the others.

This was no new room, and latest event was a week old already:

bridges=# select to_timestamp(min(received_ts)/1000), to_timestamp(max(received_ts)/1000) from events where room_id='!snip';
      to_timestamp      |      to_timestamp      
------------------------+------------------------
 2022-05-22 01:46:22+00 | 2023-12-26 10:16:56+00
(1 row)

Most events were from initially joining the room, then not much happened in terms as state, and presumably if messages were sent, they were sent prior to retention period and now deleted. None of the actual users in the room were from our own HS so we were unable to properly confirm that last hypothesis.

bridges=# select to_timestamp(received_ts/1000)::date dat, count(*) cnt from events where room_id='!snip' group by dat;
    dat     |  cnt  
------------+-------
 2022-09-26 |     1
 2022-12-07 |     1
 2023-09-23 |     1
 2023-12-24 |     2
 2023-12-26 |     2
 2023-08-03 |     2
 2022-05-22 |  5104
 2022-05-23 | 14914
 2023-06-09 |     1
 2023-02-16 |     1
 2023-10-05 |     1
 2023-07-03 |     1
 2023-09-07 |     1
(13 rows)

Also close to every state event is a invite+join, which backs our understanding:

bridges=# select type, count(*) cnt from events where room_id='!snip' group by type;
           type            |  cnt  
---------------------------+-------
 m.room.create             |     1
 uk.half-shot.bridge       |     2
 m.room.join_rules         |     1
 m.bridge                  |     2
 m.room.guest_access       |     1
 m.room.name               |     1
 m.room.power_levels       |     1
 m.room.member             | 20021
 m.room.history_visibility |     1
 m.room.avatar             |     1
(10 rows)

The aha moment was realizing that 20k invites+joins is 10k joins, thus 10k members if no one leaves, while the sum of 10k first natural integers is very close to 50M (rounding n²/2-n/2 to n² is a reasonable approximate). Supposedly there were now 1 state group per join event (confirmed), and each state groups had 1 entry per member, amounting to a terrible 50M rows.

On a bridges server, many rooms show this pattern of state events mostly comprised of invite+join and very little people leaving, either massively when initially bridging or over time. We tried to spot the pattern in other rooms.

Here we naively count events and state group entries. If the room is mostly comprised of join and invite events, with no or very little history, and our hypothesis holds, then these room will have in the order of n²/2 state group entries, with n half the number of events (half of 1 invite + 1 join), so count(events)²/8. This is a terrible approximate merely designed to check if the pattern held at all in top rooms:

bridges=# select cnt, evts, 8*cast(cnt as float)/evts/evts from investigation_rooms_by_state_groups s left join investigation_rooms_by_events e using (room_id) order by s.cnt desc limit 20;
   cnt    | evts  |      ?column?      
----------+-------+--------------------
 50155758 | 20032 | 0.9999128790101971
 50033393 | 20214 | 0.9795924284931093
 39997112 |  6479 |  7.622591763037676
 37804143 | 18137 | 0.9193872533552278
 30092162 | 19000 | 0.6668623157894736
 28028503 |  7494 | 3.9926616856157287
 22343796 | 15006 | 0.7938109042668419
 18289653 | 12247 | 0.9755196395757239
 16237140 | 11414 | 0.9970657217354616       
 11381242 | 12712 | 0.5634454503725067
 11218030 |  4734 |  4.004516402748751
  9617828 |  9132 | 0.9226461789121406
  9561422 |  9314 | 0.8817389104738831
  9500159 |  8226 |  1.123164713072573
  8018862 |  6182 |  1.678590361744646
  7773328 |  7849 |  1.009411671935376
  7466459 |  7779 | 0.9870908556416975
  7167351 |  9294 | 0.6638092868456197
(20 rows)

We were not expecting every room exhibit the pattern (ratio close to 1), but we were not expecting so many either. This was our smoking gun. Somehow, somewhere, something was causing state groups to expand to a full decompressed form.

Our interpretation

Down the rabbit hole we went. Synapse code very rarely creates state group entries. In fact it does in two different places: when a state event is stored, trying to de-duplicate reasonably by creating state group edges between state groups (the very structure that state compressors try to optimize further), and right here when purging unreferenced state groups: https://github.com/element-hq/synapse/blob/5d3850b0384c7da286031b169d9a37320b6783ea/synapse/storage/databases/state/store.py#L778

And purging unreferenced state groups is exactly what purging events does after computing a list of said groups: https://github.com/element-hq/synapse/blob/69637f8bac0959386108edeea11ba456b9334486/synapse/storage/controllers/purge_events.py#L70

It is in fact the only place where state groups are deleted and recreated except for normal event processing. And it is the very routine called by retention jobs.

We think that in some rooms at least, depending on factors we were not able to pinpoint, after purging history, the purge routine determines that some events are unreferenced and need purging, the same events that are holding the state group hierarchy, leading to the entire hierarchy being de-deduplicated, and enormous amounts of state group entries being created.

Our current status

This is as much as we can go from our understanding of synapse internals.

We did not have enough time yet to try and reproduce this on a small HS with small rooms.

We gambled on the hypothesis that this would happen when retention jobs first cleans the room history, and we should be safe compressing the state table and be on our way.

It did compress very well to about 5% of its initial size by manually running the state compressor with default settings on every room. If you remember about orphaned state groups from initial findings, these amounted to pretty much anything left and our state_groups_state table shrinked 100x. Big win. We even posited this lead to better compressed output than running the compressor against the original state groups (quite unsure about this one though). One vacuum full later we were up and running.

It did fine for a couple days. Until retention jobs kicked in, and everything is now de-deduplicated again.

We are thinking about running that database on a separate host until we find a more stable fix and are not trying to compress that table anymore.

We do need help to go further, even pointers to places in the codebase we would have missed or ideas for reproducing this in a test case.

Oh, the original report is pretty old already, no idea why I did not find it in the first place, but I surely was not the original reporter since we experienced this just weeks ago.

Sorry about spamming this issue. After a first pessimistic report, it appears that rooms that were "de-deduplicated" this time are different from the first time.

My take on this is the retention job did not finish last time, since the database crashed before it could. We were lucky that remaining rooms did not nuke our server once more. Symptoms are similar and the same pattern emerges, this time with even more entries: one room show 180M rows in state_groups_state :)

I believe we are now out of most trouble, yet I will keep a copy of the database in case I can help pinpointing the root cause.

A couple months later, this happened again on our HS, we were lucky to avoid a major downtime, so I went back down the rabbit hole of the history purging code in synapse. Here is a summary of what I found, and how I suggest we fix some of it.

The process of locating state groups to be deleted is complex and looks flawed

First, history purging happens at :

https://github.com/element-hq/synapse/blob/5624c8b961ed6a8310a2c6723ae13e854721756b/synapse/storage/databases/main/purge_events.py#L67

This is supposed to purge old events and their relationships, then return every state group to be considered first for deletion. It does this by returning every state group related to any event, including state events, that matches purge criteria, except for outliers. It also ends by removing the reference to state groups from state events matching the criteria and making them outliers. This means those same state events will never be considered in later purges (which explicitly exclude outliers).

I am unsure this is an actual issue, since associated state groups should be deleted, and we might not care about these later, but I would feel more confident if we included more state groups as candidates than leaving them potentially forever. Maybe this is also related to accumulating unreferenced state groups.

Later on, this initial list is expanded and filtered here:

https://github.com/element-hq/synapse/blob/5624c8b961ed6a8310a2c6723ae13e854721756b/synapse/storage/controllers/purge_events.py#L75

Which basically climbs up the state group tree and stops when finding state groups that are still referenced and thus should not be deleted. This is fine, except it climbs up the tree using get_previous_state_groups:

https://github.com/element-hq/synapse/blob/5624c8b961ed6a8310a2c6723ae13e854721756b/synapse/storage/databases/state/store.py#L816

From a quick overview, this does not return previous state groups but filters the state group list, keeping those that have children. This in turn breaks the code from _find_unreferenced_groups in a way that simply does not climb up the tree, leaving additional unreferenced groups, possibly forever.

Here is a reference to the separate issue about unreferenced state groups if my analysis can help there: https://github.com/matrix-org/synapse/issues/12821

I suggest:

fixing the bug in get_previous_state_groups
at least add some documentation to purging code explaining why we dereference state groups, set state events as outliers, etc.

State groups are stored indefinitely for rejected client events

When creating client events, synapse stores them very early in an UnpersistedEventContext. This is done most notably by calculate_context_info, which stores references to the event state, thus creating state groups, before the event is checked for authorization, long before it is persisted.

For instance, a client event for joining an invite-only room will generate a state group, even if it is rejected. This leads to unreferenced entries in the state group tree, which also reference unpersisted events.

I am unsure how to proceed about this. On paper, state groups are merely a structured cache, so it should do no harm except for taking up space that is never reclaimed, while changing this behavior sounds like a colossal task.

Application services weaponize this

At least, Mautrix does.

https://github.com/mautrix/go/blob/57f6cd89e3bf148e7ab497ef076e054fd506e2ab/appservice/intent.go#L91

For every bridged user, instead of checking the room permissions, Mautrix simply tries to join, then upon failure invites the user and joins again. It works fine. Except synapse stores an unreferenced state group in the tree for every bridged user, due to the denied initial join.

A bridged room state group tree is pretty straightforward, except for every user being mentioned three times: one invite, one join, and one unreferenced branch. Due to being unreferenced, it is not considered for deletion when purging the room. However, when removing the main branch, all those state groups are de-delta-ed.

https://github.com/element-hq/synapse/blob/5d3850b0384c7da286031b169d9a37320b6783ea/synapse/storage/databases/state/store.py#L778

So after a history purge, a bridged room has one unreferenced de-delta-ed state group for every room member. This grows quadratically as exposed above, and large rooms may create dozens or hundreds of millions of state group lines in a couple of minutes when purging history due to this, effectively taking the database down.

I suggest:

for now, stop cleaning state groups when purging history, this leaves unreferenced entries for cleaning scripts to scrub, but I would rather have rotting data than exploding data,
review the root causes above, as this is hardly an issue with Mautrix and could be exploited in many other ways, including intentionally.

Reporting back on this issue, this kept happening to us and spamming our production database, so we disabled any purging logics from our main homeserver during the summer, until we took the time to devise a reasonable fix.

In order to fix the issue, I actually tried and went the easy way: we are facing an issue with cached state groups that are unreferenced by any event (because none is persisted), and which do reference a previous state group, leading to them being de-delta-ed when the parent state group is deleted.

I figured this only happens in the room purging code, so fixing the issue right there, and deleting these obviously state state groups not at the source, but before they cause any damage was sound. It prevents the issue, provides a decent mechanism to regularly spot these nasty groups and remove some rotting data without scanning the entire database. Most of all it does not require I play around with the event persistence code, which looked daunting.

We will be testing this on our homeserver for the next couple of days, then I'll report back again and provide a PR. Until then, you'll find our work at: https://forge.tedomum.net/tedomum/synapse/-/compare/b066b3aa04248260bd0cfb2ac3569a6e4a3236ea...fix-exploding-state-groups?from_project_id=97&straight=false

Reporting back one more time after a week of running the "fix". It seems my understanding was incomplete, or my code has issues, because the bug manifested again last night and took down our homeserver for a couple hours after filling the disk with state group lines. I am reviewing the purging code again, and it looks like it is still missing some comments, I am unsure about some of the logic. I am however building an intuition that cleaning state groups just isn't worth it.

On one hand, I understand that if we are to clean them, we need to do so when purging, because there is no relevant test to later detect unreferenced state groups with certainty, due to state groups being inserted before events are persisted to database.
On the other hand, I understand that except for a couple special cases with outlier events (which I am unsure about), state events are not deleted when purging, and most state groups are untouched anyway. At best we are saving a couple state group lines every time we purge history, at the cost of one of the most complicated bits of synapse, which does backfire at us due to the even more complicated de-delta-ing part. This is all in favor of a more radical decision: let's stop deleting state groups when we purge history, those are rather cheap anyway, grow almost linearly with the number of state events, and compress well using the state-compressor.

This is however a drastic, no-return decision, and I am unable to weigh things properly.

@erikjohnston sorry for mentioning you, from my lurking on synapse dev channels and Github issues, it sounds like you are the most knowledgeable on the matter. Do you think we should keep fixing the purging code and properly delete state groups while avoiding edge cases that potentially insert millions of new lines? or should we just simplify things and stop removing state groups when we purge history?

Either way, I will keep working on this the next couple of weeks, and will provide a PR as soon as I have tested a working solution.

Hey, this is actually something we've been meaning to deal with lately, though haven't been able to carve out the time just yet. Let me brain dump my working understanding of the issue.

There are some quite complex interactions between deleting history, state groups, and event persistence. The major blocking issue with fixing the current code is that there is actually a nasty race when figuring out whether a state group is unreferenced or not: when we persist new state groups they start off as unreferenced and then we persist events that reference them. Changing the deleting state group logic to also delete unreferenced state groups runs a much greater risk of accidentally deleting state groups that are "in-flight" (i.e. currently unreferenced but are about to be). I have some thoughts on how we could delete state groups safely, but its non-trivial.

Once we can safely delete state groups, then I think checking if a state group that is about to be de-deltaed is actually referenced should fix this issue.

One thing that we have seen repeatedly is that using https://github.com/erikjohnston/synapse-find-unreferenced-state-groups and then deleting the matching state groups reclaims a lot of disk space. (I wouldn't necessarily recommend doing so though)

On the other hand, I understand that except for a couple special cases with outlier events (which I am unsure about), state events are not deleted when purging, and most state groups are untouched anyway. At best we are saving a couple state group lines every time we purge history, at the cost of one of the most complicated bits of synapse, which does backfire at us due to the even more complicated de-delta-ing part. This is all in favor of a more radical decision: let's stop deleting state groups when we purge history, those are rather cheap anyway, grow almost linearly with the number of state events, and compress well using the state-compressor.

It very much depends on the size and complexity of the room. The state compressor helps a lot, but state can still be one of the major uses of disk space. Certainly we have found that deleting state groups in large rooms does lead to significant disk reduction. The reason for this is that while the state events themselves aren't deleted, we can delete the mapping of event ID to state at that event (which is what state groups actually are).

Thanks to both of you for the replies.

The major blocking issue with fixing the current code is that there is actually a nasty race when figuring out whether a state group is unreferenced or not: when we persist new state groups they start off as unreferenced and then we persist events that reference them.

This is indeed what's preventing from properly deleting these stale state groups, because we can never be certain they are in fact unreferenced.

One thing that we have seen repeatedly is that using https://github.com/erikjohnston/synapse-find-unreferenced-state-groups and then deleting the matching state groups reclaims a lot of disk space. (I wouldn't necessarily recommend doing so though)

We are indeed running this very specifically for the worst rooms as a last resort.

Once we can safely delete state groups, then I think checking if a state group that is about to be de-deltaed is actually referenced should fix this issue.

I have been thinking about this a lot lately, and I want to suggest another way of fixing things. From my understanding, the issue is twofold: 1. there are unreferenced state groups in the database, which we can never properly delete, and which sometimes make up such a structure that they get massively de-delta-ed (like here, when an isolated unreferenced state group is generated for every new user joining a Mautrix bridge room) 2. the purge code, and pretty much any code trying to maintain and delete state groups is very complex, making things hard to reason about and patch (the naive patch I posted above took me hours to devise and does not work, which cost me some more hours to understand why).

We definitely need to fix the race between persisting state groups and the events that generated them, possibly by changing the core logics of event persistence, or using database transactions for instance. Once this (admittedly huge work) is done, nothing stops us from switching to a GC for deleting state groups.

Switching to garabage collecting would allow us to keep the state group deletion code simpler (just delete unreferenced state groups instead of extrapolating from the list of purged events for instance) and closer to the rest of state group related code. We could remove anything related to state groups from the room purging code, and maybe explicitely trigger the GC after a room or room history is purged. The GC could even embed some of the state compression logic.

In my mental model, state groups are a persistent cache of state history. I think that managing a cache data structure with a GC does make sense. Would you agree with my analysis? and in that case, would you agree we work on the race first, then switch to a GC model? If so, I will focus my effort on helping with the race first, making sure we only persist referenced state groups.

element-hq / synapse