Closed adeelnawaz closed 3 years ago
I have changed the memory from 256 to 512, and now the task seems stable. But still, is the memory overuse justified or could there be a memory leakage?
Could you tell me if the memory usage increases over time? The subscription API needs more memory but it shouldn't be that much. Could you also provide me a memory profile? The steps to collect a profile are described here: https://mercure.rocks/docs/hub/debug
The memory usage was at 100% right after the deployment, which was causing the container to crash. But the frontend app had around 30-50 users online at that time, so as soon as the mercure service were up, it would have to face its regular load. I'll collect a debug profile, though it won't be on 256mb memory, since I cannot afford a crash :)
How you enable subscription API on docker container? I am using the base file docker-compose.yml from the mercure documentation, but i dont't know if is correctly enabled.
He is my docker-compose file:
And how the correct way to get subscribers payload or subscriptions topics, and subscriber from a topic?
Could you tell me if the memory usage increases over time? The subscription API needs more memory but it shouldn't be that much. Could you also provide me a memory profile? The steps to collect a profile are described here: https://mercure.rocks/docs/hub/debug
Hello @dunglas,
Having the same problems, memory increases overtime and CPU constantly high.
Not sure if it's related to Mercure itself (new 0.11) or just another plugin (Vulcain) of the Caddy server having memory leaks since we moved away from the standalone Mercure server (based on the old 0.10 version).
Here is a debug profiles:
pprof.caddy.samples.cpu.001.pb.gz (30 seconds) pprof.caddy.samples.cpu.002.pb.gz (90 seconds) pprof.caddy.samples.cpu.003.pb.gz (600 seconds)
pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz (alloc_space) pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz (inuse_space) pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.003.pb.gz (inuse_space) pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.004.pb.gz (alloc_space) pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz (inuse_space) pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.006.pb.gz (inuse_space) pprof.caddy.alloc_objects.alloc_space.inuse_objects.inuse_space.007.pb.gz (alloc_space)
The difference between this debug profile is about 5-10 minutes.
I will post more debug profiles today at night when the memory will be filled and the CPU taking all cycles.
Thanks!
Edit: basically this is probably the reason for the slowdown:
For each connected client this fired in a log and the CPU goes angry having a hard time handling all of this...
{"level":"debug","ts":1616243607.2544997,"logger":"http.handlers.mercure","msg":"Subscriber has not subscribed to this update","subscriber": {}}
It's because you're in debug mode. Can you try to disable it to see the situation improves?
It's because you're in debug mode. Can you try to disable it to see the situation improves?
@dunglas the reason for enabling debug mode was only to dump some debug profiles. Indeed, normally it's a little bit better with disabled debug mode and error level set to fatal but still situation isn't good at all.
Almost two crashes of the server in the last two days. Caddy doesn't get killed by the system while taking all CPU cycles and memory - it's just sitting and waiting to crash the server...
Also, memory continues to increase more and more, new debug profiles: debug_2.zip
The previous 0.10 version was working stable for at least a year, this new version looks like has some issues :(.
Thanks!
Can you try to use the legacy
build so we'll see it's related to the Caddy module or to a change in the library itself. Thanks for the profile, I'll take a look.
Can you try to use the
legacy
build so we'll see it's related to the Caddy module or to a change in the library itself. Thanks for the profile, I'll take a look.
Sure, I will move back to legacy build and report back with the result.
Also, would like to get some debug profiles of the server when it's getting close to the crash with filled memory and CPU cycles.
Thanks!
Hi, sorry I couldn't provide you with a debig profile but for me the issue arised (original issue) when the subscription API was enabled. With it disabled, hardly 10% of the memory was used.
The client was getting anxious so I had to replace mercure with another service but I can dig out the Dockerfile and give you ECS specs if that helps recreate the issue.
On Sat, Mar 20, 2021, 14:11 Divine @.***> wrote:
Can you try to use the legacy build so we'll see it's related to the Caddy module or to a change in the library itself. Thanks for the profile, I'll take a look.
Sure, I will move back to legacy build and report back with the result.
Also, would like to get some debug profiles of the server when it's getting close to the crash with filled memory and CPU cycles.
Thanks!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dunglas/mercure/issues/463#issuecomment-803316438, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQPYDRXOOFTGRP4VOOZ6JDTESNJNANCNFSM4XM3QQ7A .
Just kindly update, looks like definitely an issue with https://github.com/caddyserver/cache-handler which was installed and enabled as well.
The actual problem with the Caddy server is that hard to say which plugin was having an issue. Currently running fine for two days.
Thanks!
@dunglas 👋,
There is definitely an issue, I'm not sure if it's related to the Caddy version or the latest changes but it's still going crazy every two days. I just didn't have time to take a closer look and dump profiles again.
Caddy doesn't get killed and it just stuck until it will be restarted. There are about 2-3M notifications going on every day, so I think it's a bit heavy and might be one of the reasons why normally nobody notices it.
I have just dumped debug profiles, can you please take a look?
Used:
go tool pprof http://localhost:2019/debug/pprof/allocs
go tool pprof http://localhost:2019/debug/pprof/heap
go tool pprof http://localhost:2019/debug/pprof/profile?seconds=30
Thanks!
I can confirm there is a memory usage increase without subscriptions over time on the latest version (didn't use previous so not sure if it's just latest one or not).
@torchello could you provide me a profile? Don't hesitate to contact me on Slack (symfony.com/slack) or by mail if you need help to create the profiles.
@dunglas Please find all non-empty profiles attached (+ full gorouting stack dump): debug_profiles.zip.
Let me know if you need any additional info. Fingers crossed for an easy fix. I've also noticed that that memory usage growth pattern is more or less the same after every deploy:
Thanks for the profiles @divine and @torchello. According to them, there are two main sources of memory consumption:
zap.Object()
when logging a subscriberBoth issues look directly related to the Mercure code base (but I'll double-check), and should probably be reported to the upstream projects.
For the record, Bolt is used for the "history" feature. If this feature isn't used, Bolt can be disabled by setting using the "local" transport. (also, the managed version of Mercure use another transport, it's probably why I didn't notice this before).
Regarding the logger, I'll what we can do.
Is the process crashing or is just the memory consumption increasing over time? If it doesn't crash, then it's intended:
Bolt uses a memory-mapped file so the underlying operating system handles the caching of the data. Typically, the OS will cache as much of the file as it can in memory and will release memory as needed to other processes. This means that Bolt can show very high memory usage when working with large databases. However, this is expected and the OS will release memory as needed. Bolt can handle databases much larger than the available physical RAM, provided its memory-map fits in the process virtual address space. It may be problematic on 32-bits systems.
https://github.com/etcd-io/bbolt#caveats--limitations and https://github.com/etcd-io/bbolt/issues/234
@divine's profiles are a bit different, a lot of memory is used by Prometheus. Are you using it? Maybe can you disable it if it isn't the case?
Hello @dunglas,
a lot of memory is used by Prometheus
It's not enabled. Maybe when debug mode is enabled it comes as true by default?
Also, at the start, I had the same feeling that history is the reason for issues, so I've changed transport to disable the history:
transport_url "local://local"
Is the process crashing
It's crashing. However, before it didn't crash but was "stuck" (no reply to any request on caddy) when used on the same server as api-platform. Probably due to high-load it didn't realize when to crash 😄.
I have moved Mercure to a different server and it's crashing every 2-3 hours.
I think today I'll able to test the legacy version and see if it doesn't crash as the caddy version.
Thanks!
Can you try to use the
legacy
build so we'll see it's related to the Caddy module or to a change in the library itself. Thanks for the profile, I'll take a look.
I deployed yesterday the legacy
build and was surprised that it didn't crash just yet neither the memory was increasing - it stays the same all the time.
Interesting... I would stick with the legacy
version then. Using Mercure on the same server as API-Platform was great but the issues it was having don't make it joy at all.
I will report back if the legacy
build will start crashing or having issues.
Thanks!
Thanks for the feedback @divine. When looking at the profile, I don't understand how it's related to Caddy. However, as Prometheus metrics are enabled by default with Caddy but not with the legacy build, I wonder if it cannot be the root cause. If the legacy build is stable, could you try to enable the Prometheus support for it to see if it starts crashing? Thanks!!
Thanks for the feedback @divine. When looking at the profile, I don't understand how it's related to Caddy. However, as Prometheus metrics are enabled by default with Caddy but not with the legacy build, I wonder if it cannot be the root cause. If the legacy build is stable, could you try to enable the Prometheus support for it to see if it starts crashing? Thanks!!
@dunglas you're welcome, hopefully, this will be fixed.
I thought Prometheus is disabled by default on both versions?
I have enabled metrics now and will report if it will crash.
However, after running for about 5 minutes I noticed the memory started increasing again.
At first, I've checked the metrics endpoint it was like 1 Mb response but it's currently close to 2 Mb and it's increasing:
# HELP mercure_subscribers The current number of running subscribers
# TYPE mercure_subscribers gauge
mercure_subscribers{topic="/users/5efabc1c90dce00ca760624b"} 0
mercure_subscribers{topic="/users/5f3cc41845ce9e209b3a66f1"} 1
mercure_subscribers{topic="/users/5f6cebb5f272d8415e3e97c1"} 1
mercure_updates_total{topic="/users/6084516572c99740d4121609"} 2
mercure_updates_total{topic="/users/608451721f04bd439543fcf0"} 1
and it goes up to 20k of lines with "mercure_updates_total"
I think this is probably the cause, you can reproduce this when sending updates to many different topics and it will crash for sure 👍
I have both versions installed and can swap, just let me know if I could somehow help with debugging and finding the root cause.
Thanks!
I think you correctly identified the program! AFAIK Prometheus is automatically enabled with Caddy.
We need to add an option to not collect metrics per topic, and probably enable this behavior by default. I'll work on this.
JFYI, after running 2 hours the metrics endpoint response is about 11 Mb and memory continues to increase:
mercure_subscribers: 7356
mercure_subscribers_total : 7356
mercure_updates_total: 145188
Quite interesting that mercure_subscribers are kept even though that user doesn't subscribe anymore:
mercure_subscribers{topic="/users/60846902bc92f800d72b6c47"} 0
mercure_subscribers_total{topic="/users/60846902bc92f800d72b6c47"} 1
mercure_updates_total{topic="/users/60846902bc92f800d72b6c47"} 8
Hope this helps as well!
Thanks!
@dunglas Thanks for the tips. Adding transport_url "local://local"
directive helps to reduce the memory usage to ~60MB:
Actually, it's still growing but much slower:
Actually, it's still growing but much slower:
Hello,
Are you using the Caddy version or legacy?
If Caddy then you need explicitly disable metrics which is enabled by default:
metrics_enabled false
Until a new fix will be pushed and not sure if that extra configuration would help.
Thanks!
@divine Yes, it's Caddy version. Will try, thanks!
@divine Yes, it's Caddy version. Will try, thanks!
Hi, this solution solve the slow memory increase rate?
Hi. I'm using Mercure's official docker image with a few customizations in Caddyfile. Without the subscription API turned on, the memory usage is very low. But as soon as I enable the subscription API, the memory usage jumps to 100% and the container crashes. Container and load specs are as follows:
CPU units: ~680 (t3a.medium EC2 instance, 2 vCPUs, shared by one Mercure task with two other tasks) Memory: 256 Subscribers: ~50 Topics: ~5 Messages: 5/minute (avg)
The memory graph attached shows the memory used when subscription API is enabled and when it's disabled. From 10:05 to 11:45 is the time when subscription API is disabled. From 07:00 to 10:00 and from 11:45 onwards are the windows when subscription API is enabled. The dips you see in those windows are actually crashes, not low memory usages. The memory consupmtion is constantly 100% when the container is running.
Unfortunately there's nothing in cloudwatch logs before the crash happens, and the reason for task being stopped in ECS console (strangely enough) is not out of memory. It says that the essential container in task exited.