Open poettig opened 4 months ago
I believe we have the same problem. We noticed increased CPU usage after the upgrade from 1.104.0 to 1.105.1. This occurs occasionally for a few minutes.
Homeserver: phys.ethz.ch Synapse Version: 1.105.1 Installation Method: pip Database: postgresql-15 15.6-0+deb12u1 Workers: Multiple workers Presence: disabled Platform:
+phd-matrix:~# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
+phd-matrix:~# uname -a
Linux phd-matrix 6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07) x86_64 GNU/Linux
Same here on a semi-large homeserver deployment. Issues started with 1.105.0, get_auth_chain_difference_chains running for minutes instead of seconds. It persistet through 1.105.1:
This evening I took a chance and reverted #17044 locally to see if it would help, but then federation breaks, so dont do it:
File "synapse/storage/databases/main/events.py", line 464, in _persist_events_txn
self._persist_event_auth_chain_txn(txn, [e for e, _ in events_and_contexts])
File "synapse/storage/databases/main/events.py", line 562, in _persist_event_auth_chain_txn
self._add_chain_cover_index(
File "synapse/storage/databases/main/events.py", line 779, in _add_chain_cover_index
for links in EventFederationStore._get_chain_links(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: type object 'EventFederationStore' has no attribute '_get_chain_links'
One more update: By reverting #17044, but keeping the class method in place for the security fix 55b0aa847a61774b6a3acdc4b177a20dc019f01a, I now have a synapse that is back to normal for get_auth_chain_difference_chains
:
From what I have spotted in #17044 so far:
The first usage of the query in the previous code:
In the new code, as the yield is at the end, I think step 2 and step 3 are switched (so first building links, then running difference_update, then step 2 from above).
On the second usage, the difference_update
is inside the inner set_to_chain
loop https://github.com/element-hq/synapse/pull/17044/files#diff-1f5d8bffadd3271e42c2f6a66474bbf5e3e6694b009aa616b5f5433506217ff1L621, while with the new code its running at the end of the outer loop and using links
as its parameter.
I am not sure which of the two usages is causing the long running queries, but reverting both to the earlier version helped to get back to normal. From my limited understanding at least the yield
and the difference_update
in the new class method probably need to be in the reverse order to get closer to the old implementation.
For me usage jumped crazily completely clogging the database lately after v1.105.1
It looks like the refactoring missed that the chains added by _materialize
to the chains
dict should be removed from chains_to_fetch
, so I guess it keeps requesting the same chains over and over, leading to a quadratic explosion.
I've also partially reverted #17044 with apparent success. This is the diff:
I see similar behavior since v1.105.0 on my synapse grafana dashboard.
I've also noticed that the disk writes are now constantly at 100-200MB/s, which causes the postgres service to gather multiple GB's or disk writes in a few minutes (as you can see by this screenshot made in iotop
).
Description
Since upgrading to v1.105.1 because of the security patch, we are seeing very high transaction times specifically for the transaction "get_auth_chain_difference_chains". In our case, we directly upgraded from v1.102.0 to v1.105.1, but heard from others with the same problem after upgrading from v1.104.0.
This could be a regression caused by https://github.com/element-hq/synapse/pull/17044 which modified code called through several indirections by "get_auth_chain_difference_chains" or a regression caused by the security patches.
Steps to reproduce
Homeserver
kit.edu
Synapse Version
1.105.1
Installation Method
Debian packages from packages.matrix.org
Database
postgresql 13.14-0+deb11u1
Workers
Multiple workers
Platform
Synapse Main, Workers and the database all run in virtual machines.
Configuration
Presence is enabled.
Relevant log output
Anything else that would be useful to know?
No response