dunglas / mercure

🪽 An open, easy, fast, reliable and battery-efficient solution for real-time communications
https://mercure.rocks
GNU Affero General Public License v3.0
3.89k stars 289 forks source link

Subscription API causes memory overshoot #463

Closed adeelnawaz closed 3 years ago

adeelnawaz commented 3 years ago

Hi. I'm using Mercure's official docker image with a few customizations in Caddyfile. Without the subscription API turned on, the memory usage is very low. But as soon as I enable the subscription API, the memory usage jumps to 100% and the container crashes. Container and load specs are as follows:

CPU units: ~680 (t3a.medium EC2 instance, 2 vCPUs, shared by one Mercure task with two other tasks) Memory: 256 Subscribers: ~50 Topics: ~5 Messages: 5/minute (avg)

The memory graph attached shows the memory used when subscription API is enabled and when it's disabled. From 10:05 to 11:45 is the time when subscription API is disabled. From 07:00 to 10:00 and from 11:45 onwards are the windows when subscription API is enabled. The dips you see in those windows are actually crashes, not low memory usages. The memory consupmtion is constantly 100% when the container is running.

Mercure memory

Unfortunately there's nothing in cloudwatch logs before the crash happens, and the reason for task being stopped in ECS console (strangely enough) is not out of memory. It says that the essential container in task exited.

adeelnawaz commented 3 years ago

I have changed the memory from 256 to 512, and now the task seems stable. But still, is the memory overuse justified or could there be a memory leakage?

dunglas commented 3 years ago

Could you tell me if the memory usage increases over time? The subscription API needs more memory but it shouldn't be that much. Could you also provide me a memory profile? The steps to collect a profile are described here: https://mercure.rocks/docs/hub/debug

adeelnawaz commented 3 years ago

The memory usage was at 100% right after the deployment, which was causing the container to crash. But the frontend app had around 30-50 users online at that time, so as soon as the mercure service were up, it would have to face its regular load. I'll collect a debug profile, though it won't be on 256mb memory, since I cannot afford a crash :)

daniboyBr commented 3 years ago

How you enable subscription API on docker container? I am using the base file docker-compose.yml from the mercure documentation, but i dont't know if is correctly enabled.

He is my docker-compose file:

806528582_393515

And how the correct way to get subscribers payload or subscriptions topics, and subscriber from a topic?

divine commented 3 years ago

Could you tell me if the memory usage increases over time? The subscription API needs more memory but it shouldn't be that much. Could you also provide me a memory profile? The steps to collect a profile are described here: https://mercure.rocks/docs/hub/debug

Hello @dunglas,

Having the same problems, memory increases overtime and CPU constantly high.

Not sure if it's related to Mercure itself (new 0.11) or just another plugin (Vulcain) of the Caddy server having memory leaks since we moved away from the standalone Mercure server (based on the old 0.10 version).

Here is a debug profiles:

pprof.caddy.samples.cpu.001.pb.gz (30 seconds) pprof.caddy.samples.cpu.002.pb.gz (90 seconds) pprof.caddy.samples.cpu.003.pb.gz (600 seconds)

pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz (alloc_space) pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz (inuse_space) pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.003.pb.gz (inuse_space) pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.004.pb.gz (alloc_space) pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz (inuse_space) pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.006.pb.gz (inuse_space) pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.007.pb.gz (alloc_space)

debug.zip

The difference between this debug profile is about 5-10 minutes.

I will post more debug profiles today at night when the memory will be filled and the CPU taking all cycles.

Thanks!

Edit: basically this is probably the reason for the slowdown:

For each connected client this fired in a log and the CPU goes angry having a hard time handling all of this...

{"level":"debug","ts":1616243607.2544997,"logger":"http.handlers.mercure","msg":"Subscriber has not subscribed to this update","subscriber": {}}

dunglas commented 3 years ago

It's because you're in debug mode. Can you try to disable it to see the situation improves?

divine commented 3 years ago

It's because you're in debug mode. Can you try to disable it to see the situation improves?

@dunglas the reason for enabling debug mode was only to dump some debug profiles. Indeed, normally it's a little bit better with disabled debug mode and error level set to fatal but still situation isn't good at all.

Almost two crashes of the server in the last two days. Caddy doesn't get killed by the system while taking all CPU cycles and memory - it's just sitting and waiting to crash the server...

Also, memory continues to increase more and more, new debug profiles: debug_2.zip

The previous 0.10 version was working stable for at least a year, this new version looks like has some issues :(.

Thanks!

dunglas commented 3 years ago

Can you try to use the legacy build so we'll see it's related to the Caddy module or to a change in the library itself. Thanks for the profile, I'll take a look.

divine commented 3 years ago

Can you try to use the legacy build so we'll see it's related to the Caddy module or to a change in the library itself. Thanks for the profile, I'll take a look.

Sure, I will move back to legacy build and report back with the result.

Also, would like to get some debug profiles of the server when it's getting close to the crash with filled memory and CPU cycles.

Thanks!

adeelnawaz commented 3 years ago

Hi, sorry I couldn't provide you with a debig profile but for me the issue arised (original issue) when the subscription API was enabled. With it disabled, hardly 10% of the memory was used.

The client was getting anxious so I had to replace mercure with another service but I can dig out the Dockerfile and give you ECS specs if that helps recreate the issue.

On Sat, Mar 20, 2021, 14:11 Divine @.***> wrote:

Can you try to use the legacy build so we'll see it's related to the Caddy module or to a change in the library itself. Thanks for the profile, I'll take a look.

Sure, I will move back to legacy build and report back with the result.

Also, would like to get some debug profiles of the server when it's getting close to the crash with filled memory and CPU cycles.

Thanks!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dunglas/mercure/issues/463#issuecomment-803316438, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQPYDRXOOFTGRP4VOOZ6JDTESNJNANCNFSM4XM3QQ7A .

divine commented 3 years ago

Just kindly update, looks like definitely an issue with https://github.com/caddyserver/cache-handler which was installed and enabled as well.

The actual problem with the Caddy server is that hard to say which plugin was having an issue. Currently running fine for two days.

Thanks!

divine commented 3 years ago

@dunglas 👋,

There is definitely an issue, I'm not sure if it's related to the Caddy version or the latest changes but it's still going crazy every two days. I just didn't have time to take a closer look and dump profiles again.

Caddy doesn't get killed and it just stuck until it will be restarted. There are about 2-3M notifications going on every day, so I think it's a bit heavy and might be one of the reasons why normally nobody notices it.

I have just dumped debug profiles, can you please take a look?

Used:

go tool pprof http://localhost:2019/debug/pprof/allocs
go tool pprof http://localhost:2019/debug/pprof/heap
go tool pprof http://localhost:2019/debug/pprof/profile?seconds=30

debug_profiles.zip

Thanks!

torchello commented 3 years ago

I can confirm there is a memory usage increase without subscriptions over time on the latest version (didn't use previous so not sure if it's just latest one or not). image

dunglas commented 3 years ago

@torchello could you provide me a profile? Don't hesitate to contact me on Slack (symfony.com/slack) or by mail if you need help to create the profiles.

torchello commented 3 years ago

@dunglas Please find all non-empty profiles attached (+ full gorouting stack dump): debug_profiles.zip.

Let me know if you need any additional info. Fingers crossed for an easy fix. I've also noticed that that memory usage growth pattern is more or less the same after every deploy: image

dunglas commented 3 years ago

Thanks for the profiles @divine and @torchello. According to them, there are two main sources of memory consumption:

  1. BoltDB (we use the bbolt fork)
  2. The logger (Zap), and especially the calls to zap.Object() when logging a subscriber

Both issues look directly related to the Mercure code base (but I'll double-check), and should probably be reported to the upstream projects.

For the record, Bolt is used for the "history" feature. If this feature isn't used, Bolt can be disabled by setting using the "local" transport. (also, the managed version of Mercure use another transport, it's probably why I didn't notice this before).

Regarding the logger, I'll what we can do.

dunglas commented 3 years ago

Is the process crashing or is just the memory consumption increasing over time? If it doesn't crash, then it's intended:

Bolt uses a memory-mapped file so the underlying operating system handles the caching of the data. Typically, the OS will cache as much of the file as it can in memory and will release memory as needed to other processes. This means that Bolt can show very high memory usage when working with large databases. However, this is expected and the OS will release memory as needed. Bolt can handle databases much larger than the available physical RAM, provided its memory-map fits in the process virtual address space. It may be problematic on 32-bits systems.

https://github.com/etcd-io/bbolt#caveats--limitations and https://github.com/etcd-io/bbolt/issues/234

dunglas commented 3 years ago

@divine's profiles are a bit different, a lot of memory is used by Prometheus. Are you using it? Maybe can you disable it if it isn't the case?

divine commented 3 years ago

Hello @dunglas,

a lot of memory is used by Prometheus

It's not enabled. Maybe when debug mode is enabled it comes as true by default?

Also, at the start, I had the same feeling that history is the reason for issues, so I've changed transport to disable the history:

transport_url "local://local"

Is the process crashing

It's crashing. However, before it didn't crash but was "stuck" (no reply to any request on caddy) when used on the same server as api-platform. Probably due to high-load it didn't realize when to crash 😄.

I have moved Mercure to a different server and it's crashing every 2-3 hours.

I think today I'll able to test the legacy version and see if it doesn't crash as the caddy version.

Thanks!

divine commented 3 years ago

Can you try to use the legacy build so we'll see it's related to the Caddy module or to a change in the library itself. Thanks for the profile, I'll take a look.

I deployed yesterday the legacy build and was surprised that it didn't crash just yet neither the memory was increasing - it stays the same all the time.

Interesting... I would stick with the legacy version then. Using Mercure on the same server as API-Platform was great but the issues it was having don't make it joy at all.

I will report back if the legacy build will start crashing or having issues.

Thanks!

dunglas commented 3 years ago

Thanks for the feedback @divine. When looking at the profile, I don't understand how it's related to Caddy. However, as Prometheus metrics are enabled by default with Caddy but not with the legacy build, I wonder if it cannot be the root cause. If the legacy build is stable, could you try to enable the Prometheus support for it to see if it starts crashing? Thanks!!

divine commented 3 years ago

Thanks for the feedback @divine. When looking at the profile, I don't understand how it's related to Caddy. However, as Prometheus metrics are enabled by default with Caddy but not with the legacy build, I wonder if it cannot be the root cause. If the legacy build is stable, could you try to enable the Prometheus support for it to see if it starts crashing? Thanks!!

@dunglas you're welcome, hopefully, this will be fixed.

I thought Prometheus is disabled by default on both versions?

I have enabled metrics now and will report if it will crash.

However, after running for about 5 minutes I noticed the memory started increasing again.

At first, I've checked the metrics endpoint it was like 1 Mb response but it's currently close to 2 Mb and it's increasing:

# HELP mercure_subscribers The current number of running subscribers
# TYPE mercure_subscribers gauge

mercure_subscribers{topic="/users/5efabc1c90dce00ca760624b"} 0
mercure_subscribers{topic="/users/5f3cc41845ce9e209b3a66f1"} 1
mercure_subscribers{topic="/users/5f6cebb5f272d8415e3e97c1"} 1
mercure_updates_total{topic="/users/6084516572c99740d4121609"} 2
mercure_updates_total{topic="/users/608451721f04bd439543fcf0"} 1

and it goes up to 20k of lines with "mercure_updates_total"

I think this is probably the cause, you can reproduce this when sending updates to many different topics and it will crash for sure 👍

I have both versions installed and can swap, just let me know if I could somehow help with debugging and finding the root cause.

Thanks!

dunglas commented 3 years ago

I think you correctly identified the program! AFAIK Prometheus is automatically enabled with Caddy.

We need to add an option to not collect metrics per topic, and probably enable this behavior by default. I'll work on this.

divine commented 3 years ago

JFYI, after running 2 hours the metrics endpoint response is about 11 Mb and memory continues to increase:

mercure_subscribers: 7356
mercure_subscribers_total : 7356
mercure_updates_total: 145188

Quite interesting that mercure_subscribers are kept even though that user doesn't subscribe anymore:

mercure_subscribers{topic="/users/60846902bc92f800d72b6c47"} 0
mercure_subscribers_total{topic="/users/60846902bc92f800d72b6c47"} 1
mercure_updates_total{topic="/users/60846902bc92f800d72b6c47"} 8

Hope this helps as well!

Thanks!

torchello commented 3 years ago

@dunglas Thanks for the tips. Adding transport_url "local://local" directive helps to reduce the memory usage to ~60MB: image

torchello commented 3 years ago

Actually, it's still growing but much slower: image

divine commented 3 years ago

Actually, it's still growing but much slower:

Hello,

Are you using the Caddy version or legacy?

If Caddy then you need explicitly disable metrics which is enabled by default:

metrics_enabled false

Until a new fix will be pushed and not sure if that extra configuration would help.

Thanks!

torchello commented 3 years ago

@divine Yes, it's Caddy version. Will try, thanks!

ingfdoaguirre commented 2 years ago

@divine Yes, it's Caddy version. Will try, thanks!

Hi, this solution solve the slow memory increase rate?