element-hq / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://element-hq.github.io/synapse
GNU Affero General Public License v3.0
1.38k stars 162 forks source link

Memory leak in synchrotron worker #17791

Open kosmos opened 4 days ago

kosmos commented 4 days ago

Description

I have the following synchrotron worker configuration:

# Sync initial/normal
location ~ ^/_matrix/client/(r0|v3)/sync$ {
  proxy_pass https://$sync;
}

# Synchrotron
location ~ ^/_matrix/client/(api/v1|r0|v3)/events$ {
  proxy_pass https://synapse_sync;
}

# Initial_sync
location ~ ^/_matrix/client/(api/v1|r0|v3)/initialSync$ {
  proxy_pass https://synapse_initial_sync;
}

location ~ ^/_matrix/client/(api/v1|r0|v3)/rooms/[^/]+/initialSync$ {
  proxy_pass https://synapse_initial_sync;
}

And the following cache settings for this worker:

event_cache_size: 150K
caches:
  global_factor: 1
  expire_caches: true
  cache_entry_ttl: 30m
  sync_response_cache_duration: 2m
  cache_autotuning:
    max_cache_memory_usage: 2048M
    target_cache_memory_usage: 1792M
    min_cache_ttl: 1m
  per_cache_factors:
    stateGroupCache: 1
    stateGroupMembersCache: 5
    get_users_in_room: 20
    get_users_who_share_room_with_user: 10
    _get_linearized_receipts_for_room: 20
    get_presence_list_observers_accepted: 5
    get_user_by_access_token: 2
    get_user_filter: 2
    is_support_user: 2
    state_cache: 5
    get_current_state_ids: 5
    get_forgotten_rooms_for_user: 5

All other worker types work without problems, but it is a memory leak in synchrotrons, which leads to the exhaustion of all memory.

It seems that the cache_autotuning settings are not working. The environment variable PYTHONMALLOC=malloc is set at the operating system level.

According to my impressions, the problem became relevant after updating to 1.114 of Synapse and remains relevant in 1.116.

Steps to reproduce

To reproduce the problem, you need a homeserver with a heavy load and dedicated synchrotron workers.

Homeserver

Synapse 1.116.0

Synapse Version

Synapse 1.116.0

Installation Method

Docker (matrixdotorg/synapse)

Database

PostgreSQL

Workers

Multiple workers

Platform

-

Configuration

No response

Relevant log output

-

Anything else that would be useful to know?

No response

anoadragon453 commented 1 day ago

Hi @kosmos, thanks for filing an issue. Can I just double-check that you're using jemalloc in your configuration? Doing so is required for the cache_autotuning option.

kosmos commented 1 day ago

@anoadragon453 I'm using the official docker image, and this feature (jemalloc) is enabled by default.

anoadragon453 commented 20 hours ago

@kosmos Did you upgrade from 1.1130 before being on 1.114.0? I don't see any changes that are particularly related to caches in 1.114.0.

Around the time of the memory ballooning, are you seeing lots of initial syncs at once? Those requests are known to be memory intensive, especially for users with a large amount of rooms.

Do you have monitoring set up for your Synapse instance? If so, could you have a look at the Caches -> Cache Size graph around the time of the memory ballooning to see what cache might be overinflating? I'm also happy to poke through your Grafana metrics if you can make them privately or publicly available. Feel free to DM me at @andrewm:element.io if you'd like to share them via that route.