Native binary becomes unresponsive after some time and burns CPU cycles

flisk commented 2 months ago

Describe the bug Just wanted to preface this with a statement of appreciation: thanks for your continued work on this project. I've been using a self-hosted CommaFeed instance for many years now, and I've been very happy with it. It's great to see that you remain committed to keeping it going. :)

Since I upgraded to CommaFeed 5 and started deploying the native binaries on my server, I've had several instances of the CommaFeed process "locking up", spewing a lot of unclear errors (see below), going through a bunch of CPU cycles and not responding to HTTP requests or even SIGTERMs.

Logs

https://pastebin.com/raw/3jHargE2

This is the daemon log from just before the last instance where I saw CPU usage go up and the application became unresponsive. Sorry for linking to an external site, but GitHub isn't letting me attach file due to some CORS problem.

To Reproduce I'm unfortunately not sure what causes this to happen. So far, the only reliable way I have to reproduce the problem is to let CommaFeed run on my server for a few days.

Environment (please complete the following information):

CommaFeed version: 5.0.2

(other info omitted due to presumed lack of relevance)

Athou commented 2 months ago

Just wanted to preface this with a statement of appreciation: thanks for your continued work on this project. I've been using a self-hosted CommaFeed instance for many years now, and I've been very happy with it. It's great to see that you remain committed to keeping it going. :)

Thank you!

Since I upgraded to CommaFeed 5 and started deploying the native binaries on my server, I've had several instances of the CommaFeed process "locking up", spewing a lot of unclear errors (see below), going through a bunch of CPU cycles and not responding to HTTP requests or even SIGTERMs.

Reading the log files, I can set that a lot of the errors happen during a call to a method of Safepoint, which seems to be a way for the native image to pause so that the garbage collection can occur. Looks like the garbage collection phase is taking a lot of time.

Is your server running low on memory by any chance? Or are you restricting CommaFeed to a maximum amount of memory? If so, the threshold may be too low.

flisk commented 2 months ago

I'm not restricting CommaFeed's memory as such, but the VM does have just 1 GB of RAM, which I see being mostly consumed during these events. I'll double that later and see if the issue resurfaces.

I'd be a little surprised if this solved the problem though – I thought the switch to native builds was supposed to bring down RAM usage. That same VM's 1 GB of RAM was fine when I ran CommaFeed under Java.

Athou commented 2 months ago

I'd be a little surprised if this solved the problem though

Me too, the native executable for commafeed.com oscillates between 150 and 300Mb for several hundreds of concurrent users. It must be something else but I don't know what.

Athou commented 2 months ago

CommaFeed is the only app running in the VM and the host's CPUs are not saturated, right? Does this happen with the jvm version of CommaFeed too?

flisk commented 2 months ago

Yes, CommaFeed is pretty much alone in that VM. The CPU isn't saturated.

I haven't tried a JVM build since upgrading to 5.x.x. I'll do that once I've reproduced the issue with a bigger RAM allocation to the VM (or when I've failed to reproduce it, depending).

Results will take a little while, since occurence is infrequent.

flisk commented 2 months ago

Just a quick update: nothing so far with the native binary on 2 GB of RAM. I'll probably switch it out for the "vanilla" Java version this weekend and see how that behaves with both 1 GB and 2 GB.

flisk commented 2 months ago

Just saw Commafeed die of memory exhaustion running off the jar file with 2 GB of RAM:

2024-09-17 18:20:59,153 ERROR [com.com.bac.fee.FeedRefreshEngine] (pool-9-thread-4) error while processing feed http://speedboatdope.com/rss/: java.util.concurrent.CompletionException: java.lang.O>
        at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
        at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
        at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1770)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.OutOfMemoryError: Java heap space

It seems like the problem isn't isolated to the native version, it just presents with different log output. The process also isn't spinning in place like the native binary did, it just sits there with barely any CPU usage and doesn't respond to HTTP requests.

Athou commented 2 months ago

Really strange. Could it be a memory leak? Do you have a way to monitor the process memory usage, or check at regular intervals to see if it increases steadily?

Do you mind sharing a little bit information about your setup?

Which database is used for CommaFeed
What OS is used in the VM
Non-default CommaFeed configuration properties

It could also be a specific feed that causes this issue. Could you eventually give me an export of your OPML file with the feeds you're following so I could try to reproduce the issue? You can send it to me in private if you're not comfortable sharing it publicly.

flisk commented 2 months ago

I'm pretty sure it's a memory leak. I don't have nicely visualized data to share, but I've been watching the memory usage on the VM tick up as time goes on.

I'm using PostgreSQL 15.8, the VM is running Debian 12, and my CommaFeed config sans secrets is:

commafeed.image-proxy-enabled=true
commafeed.announcement=Welcome to CommaFeed 5.0.1! Look 'ma, all native!
commafeed.feed-refresh.interval=1h

quarkus.datasource.jdbc.url=jdbc:postgresql://localhost:5432/commafeed
quarkus.datasource.username=commafeed
quarkus.datasource.password=[snip]

quarkus.http.auth.session.encryption-key=[snip]

I'd be happy to share an OPML export privately. I'll send you an email with the OPML attached shortly.

Athou commented 2 months ago

Thank you, I'll try to reproduce the issue.

In the mean time, could you please disable commafeed.image-proxy-enabled just to make sure this setting is not causing the memory leak? It's not enabled on commafeed.com and I don't have memory issues on that instance. Thanks!

Athou commented 2 months ago

I have an instance running the jvm package for 8h with postgres, your OPML file and commafeed.image-proxy-enabled enabled and the memory seems stable at 440Mb without any memory setting to reclaim unused space on the command line :/

What is the command-line command you're using to start CommaFeed? Have you tried setting a hard limit or using dynamic sizing as mentionned in the README?

flisk commented 1 month ago

That's strange, but not entirely unexpected, I think. I've been watching the memory usage on and off over a period of days, and it takes over a week to run out of memory on my 2 GB system.

I'm doing some unrelated experimentation with monitoring systems right now, which means I can provide a graph as memory usage goes up. Here's what it looks like starting last night to now: Screenshot 2024-09-27 at 01 00 58

I'll provide another screenshot when that graph starts to look more definite.

The command line invocation isn't anything fancy. I do have some extra sandboxing stuff in my unit file, which I doubt is related, but I can rip all of that out as well to see if it's related:

[Unit]
Description=CommaFeed app server
StartLimitIntervalSec=0

[Service]
ExecStart=/srv/commafeed/commafeed-5.3.0-postgresql-linux-x86_64-runner

Restart=on-failure
RestartSec=5
SuccessExitStatus=143

User=commafeed
WorkingDirectory=/srv/commafeed
SyslogIdentifier=%N

# Hardening
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
PrivateDevices=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
ProtectControlGroups=yes
RestrictAddressFamilies=AF_INET AF_INET6
RestrictNamespaces=yes
LockPersonality=yes
RestrictRealtime=yes
RemoveIPC=yes
ProtectHostname=yes
RestrictSUIDSGID=yes

# Hardening / System Call Filtering
SystemCallFilter=@system-service
SystemCallErrorNumber=EPERM
SystemCallArchitectures=native

[Install]
WantedBy=multi-user.target

I have not tried changing any of the memory-related options yet, but I'm back to the native binary for now either way, so that shouldn't apply.

Athou commented 1 month ago

I have not tried changing any of the memory-related options yet, but I'm back to the native binary for now either way, so that shouldn't apply.

In my experience, the memory flags are not needed with the native binary, but it might help in your case. Could you try this?

ExecStart=/srv/commafeed/commafeed-5.3.0-postgresql-linux-x86_64-runner -Xmx256m

flisk commented 1 month ago

Yeah, that seems to have done the trick. There's something else going on with an unrelated SQL state error that I see from time to time, but I'll file a separate issue for that at a later date.

Screenshot 2024-10-09 at 23 02 51 (Context: the restart between 06 and 07 is when I applied the memory limit launch arg)

If that's enough for you to consider this fixed, feel free to close :)

Athou commented 1 month ago

Awesome, I'll tweak the README a little bit to indicate that the flag also works for the native image then I'll close this 👍

Athou / commafeed

Native binary becomes unresponsive after some time and burns CPU cycles #1539