dunglas / frankenphp

🧟 The modern PHP app server
https://frankenphp.dev
MIT License
6.64k stars 218 forks source link

Memory leak #366

Closed ovrflo closed 8 months ago

ovrflo commented 9 months ago

Frankenphp seems to be leaking memory (at least in my specific configuration). I first noticed it when benchmarking it. It would staircase to heaven with each request passing through it, eventually getting killed by OOM killer. Just now I noticed that its memory usage keeps growing while idle as well (well, almost idle; it still handles healthchecks). Below you'll find a screenshot showing an idle frankenphp pod growing both in memory usage and CPU usage while receiving 1 request every 5 seconds from an ELB. Whatever is leaking is also adding on the CPU workload.

Screenshot: https://imgur.com/a/a6wbI2A

Setup:

[PHP Modules]
amqp
apcu
bcmath
calendar
Core
ctype
curl
date
dom
exif
FFI
fileinfo
filter
ftp
gd
gettext
hash
iconv
imagick
intl
json
ldap
libxml
mbstring
mysqlnd
openssl
pcntl
pcre
PDO
pdo_mysql
pdo_sqlite
Phar
posix
random
readline
redis
Reflection
session
SimpleXML
soap
sockets
sodium
SPL
sqlite3
standard
tokenizer
xml
xmlreader
xmlwriter
xsl
Zend OPcache
zlib

[Zend Modules]
Zend OPcache

Let me know how can I help to get to the bottom of this. I'm not really familiar with the development tools involved in golang/caddy/frankenphp, but I'm a quick learner.

dunglas commented 9 months ago

Thanks for the report. May it be a duplicate of https://github.com/dunglas/frankenphp/issues/280?

Could you try without the worker mode? Could you also try to check where is the memory leak by following the steps described in the issue mentioned above?

It would help a lot to understand what's going on.

Thank you.

ovrflo commented 9 months ago

Hi!

Thanks for the quick response!

Here's what I did.

Profiles -- https://www.dropbox.com/scl/fi/nmecz28yep1vurcc6n7hh/frankenphp_pprof_1.tar.gz?rlkey=zsyb6k8kka59noic4wkz9yv7r&dl=0 Memory usage screenshot -- https://imgur.com/a/OIBBX7t

I can see memory going up with and without worker mode. Without worker mode it's a slower leak, but I feel like it's similar. It looks like it leaks some memory with each request, regardless of the worker mode. A critical difference being the request rate -- worker mode allows a lot more requests in the same timeframe (normal=15k reqs/3 min; worker=60k reqs/3 min)

Also, do you get OOM errors?

About that. I didn't get them in the current test, but I'm pretty sure I did get OOM when I ran longer benchmarks (30 minutes, 300 connections). K8s would keep killing the pods when reaching the configured limit (1 GB).

I've just noticed some other tools mentioned in the other thread. Let me know if you want me to use those too.

withinboredom commented 9 months ago

when reaching the configured limit (1 GB).

1 GB is too low a limit with the default GC settings of Go (Go is designed for running on servers, which usually have gobs of memory, fast networks, etc.). I usually see ~3-5 GB of usage before GC starts to kick in with the default settings. You may need to tune Go's GC for such low memory allocations using the GOMEMLIMIT environment variable.

Give that a try and see if it does any better.

See: https://tip.golang.org/doc/gc-guide#Memory_limit

ovrflo commented 9 months ago

Thanks for the tip! This may mean we need a performance tuning guide.

Running in k8s, I'd rather have a lot of smaller pods instead of few huge pods. There's also the issue of staging environment (where I'm currently testing in) where I'd want to use as few resources as possible.

I've read most of the linked documentation. I'm going to keep the hard 1GiB limit on the container, set GOMEMLIMIT to 1GiB as well and set GOGC to 70%. As I understand it, when usage hits GOGC (=70%), it runs the GC.

Another thing that I gleaned from the doc is that Go's GC is aware of all the memory usage (that includes Cgo/php), but obviously can't do anything about it. So, when I set the memory limit to 1 GiB, that limit will also include whatever php is using, which is exactly what I would want/expect (e.g: when php starts hogging memory, Go should try to free its own, to respect the limit). Can you guys confirm this is correct?

... a while later...

I've ran some stress testing... Setting the memory limit to 1GiB didn't seem to have an (obvious) impact, so I lowered it to 512MiB. The memory usage still grew, but at ~384MiB (70%*512) it would start growing slower, thanks to the GC. I disabled the worker mode and it still grew. A lot slower, but it did. I noticed at some point the CPU throttling in effect, correlated with hitting the 384 MB point (GC in full swing, eating precious CPU cycles from the workload). And then some very odd stuff happend. The pods kinda froze, stopped processing requests (timeout). They were running and the memory usage would grow, but no responses. I tried redeploying and the new pods (while the older ones were still active) were not starting up with an error "PHP Fatal Error: Could not create timer" (something like that; I'm sorry, I didn't copy the exact message).

I'll need to take a break now. I'll get back to it later today, trying to capture the profiles using the same methodology as yesterday so that we can compare them. Speaking of pprof, I don't need to run pprof twice, do I? I was running it once for heap and second for allocs but I have a feeling the exports are actually the same.

withinboredom commented 9 months ago

Running in k8s, I'd rather have a lot of smaller pods instead of few huge pods.

A little off-topic, but generally the only things (IRL) running out there with that little memory are IoT devices, most machines these days have at least 4GB of memory (I don't even think you can buy a 1GB stick of RAM anymore).

Remember that software is built and optimized on developer machines, that usually have at least 4GB of memory on them. Go's philosophy tends to be: memory is cheap, CPU is not (adding more memory is a simple hardware swap in physical machines, up to a limit, while a new CPU tends to also require swapping an entire mainboard as well).

Back on topic, you'll need to leave enough memory for PHP to operate as well (maximum number of workers baseline worker memory per-request memory). It appears you might be "deadlocking" due to Go being unable to give up more memory due to PHP hogging memory.

I have a test-harness that allows running your code in worker mode via frankenphp-cli to test for memory leaks and send it fake requests. I'll clean it up tonight and make it public. I use it to hunt down memory leaks in my app and provide some metrics/suggestions (baseline worker memory, optimum number of workers, per-request memory) to provide some tuning variables.

EvanEdwards commented 8 months ago

A little off-topic, but generally the only things (IRL) running out there with that little memory are IoT devices, most machines these days have at least 4GB of memory (I don't even think you can buy a 1GB stick of RAM anymore).

To continue this off-topic point (that intrigues me for personal reasons), a Raspberry Pi Zero 2 W would be a low end target and has 512MB total memory. I'm not sure that a production server targeted project should aim quite that low, but it is a bottom end to consider. I suspect it is probably below a reasonable threshold and 4GB is likely the next "category" step up.

wizhippo commented 8 months ago

A little off-topic, but generally the only things (IRL) running out there with that little memory are IoT devices, most machines these days have at least 4GB of memory (I don't even think you can buy a 1GB stick of RAM anymore).

To add we are all no running on bare hardware, in a containerized environment 4GB can be a pretty large container for smaller services.

Perhaps I misread and you are specifically directing this at development environments and not production. Your comment kind of states this but not clearly as I read it.

withinboredom commented 8 months ago

Ok, cleaned it up and put the script here. Basically, it's a self-contained PHP file that can be copied into a running container (such as a production one) to measure some basic things.

Perhaps I misread and you are specifically directing this at development environments and not production. Your comment kind of states this but not clearly as I read it.

Sorry, I get annoyed at Go sometimes. I was ranting a bit. It's a good language, but some of the very low-level defaults are ... interesting.

Anyway, using that small bit of code can possibly help you narrow down if your code is leaking memory (I usually run it once for 100 requests, then again for 1000. If it isn't leaking memory then the numbers should come out almost -- if not exactly -- the same).

ovrflo commented 8 months ago

I'm happy to see such a lively debate and, before I go any further, I'd like to thank you all for your input and @withinboredom for the nice script. I've only read it, didn't use it yet.

I was going to respond later today, after I got the chance to gather more data but the offtopic branch of this thread does interest me. I'm going to defend the "a hello world app should be able to run on a 512 MiB rPI." side of the conversation. If I need more than 512 MiB of RAM, it must be because of the app is a huge beast and/or gets a lot of traffic, not because the HTTP parser needs 1 GB of RAM to handle some TCP streams.

What I expect from caddy is to consume a small-ish amount of RAM per connection and/or request. We can't pretend webservers need gigs of RAM to work when a few days ago we were happily running nginx with thousands of rps on an rPI. Maybe caddy has more features. We can explain some more memory usage that way, but we can't explain gigs worth of RAM... Caddy is written in Go, it's garbage collected. We all know how GC impacts memory usage, but we also know we can tune that and have it work fairly predictably. GC means more memory usage, but it doesn't account for that much.

What I expect from the frankenphp side of things is:

The last 2 points, maybe 3, are my concern. If my app is poorly written, it's going to leak memory in the worker. If the framework I'm using is not optimised for long running processes, it may leak memory over time. If I'm using some poorly written PHP extension, again, it may leak memory. Those things are my responsibility and I can control them. I expect whatever memory leaks happen, to be from my codebase, not the harness (frankenphp, or caddy).

Assuming every part of this system is leak proof, I should be able to run my new API platform app in a predictable manner with even 512 MiB of RAM, right? It may not be able to process more than 10-20 rps, but it should be able to run without getting killed by the kernel.

I'm running my app in k8s. I'd rather have HPA spawn more pods to handle the demand instead of having a few large pods happily idling with a few gigs of reserved RAM. My staging environment reserves about 2 GB of RAM for mercure+redis+rabbit+mysql. I'm not willing to reserve another 2 GB on just the API platform pod.

A fresh API platform can definitely run on an rPI using nginx and php-fpm. Why should it need more running under caddy?

That being said, let's figure out where's the memory leak (if any) and let this topic be a starting point for a future "frankenphp production tuning guide" :)

dunglas commented 8 months ago

Sorry, I was very busy today and didn't have the time to reproduce, but I'm pretty sure that the "create timer" error you got is the symptom of a leak.

Basically, that means that the timer created by PHP to handle timeouts isn't released, while it should be at the end of the request. The default timer limit is quite high for most environments, so it means that we don't finish the (PHP, and so maybe Go) request, or not finish it properly.

withinboredom commented 8 months ago

A fresh API platform can definitely run on an rPI using nginx and php-fpm. Why should it need more running under caddy?

It probably doesn't, but worker mode is a totally different beast than cgi-mode. It does use more memory because there is more accounting, things held in memory between requests, etc. This is a trade-off (as everything is) for more speed. If you want more speed, you need to give up something, and in this case, it's memory.

A fresh (example) API platform uses ~150 mb per worker (~50mb per worker&request + 100mb worker), which means you can realistically expect about 5 workers using 750mb under load (which is about 5 concurrent requests). In CGI mode, you can handle quite a bit more than that (in concurrent requests) but give up quite a bit of speed.

ovrflo commented 8 months ago

@withinboredom yeah, I totally get your point, while I also stand by mine: we need to think critically about the resource usage, not just throw RAM at the problem. Thinking critically, you can find performance optimization opportunities, or you can find the assurance that the system is as good as it gets. I'm okay with throwing RAM only after deciding the system is as good as it gets, or any remaining optimization opportunities are not worth it.

I may be biased from "the good ol' days", on php5.4, symfony2.x, reporting just a few MBs of memory usage. I may need an "update" on the memory requirements of today's apps.

That being said, I'm going to use your script to figure out how much memory I need per worker, per request and figure out how to tune this.

ovrflo commented 8 months ago

I had to change the script a bit but after running it looks fairly stable.

Running it with 100 requests yields

Starting with 2.00MB memory
Running request....................................................................................................Done
Stats:
Initial bootstrap: 6.00MB
Average memory usage per request: 184.32KB
Max memory usage per request: 16.00MB
Min memory usage per request: 0.00B

I had it run 10k loops and then i watchd ps -o pid,user,rss,vsz,comm yielding

PID   USER     RSS  VSZ  COMMAND
  183 root      61m 162m php

So, the actual memory usage while working is 61 MiB. I then tried to measure the memory usage while sleeping, before doing any work -- the baseline memory usage -- and found 29m. So that's PHP's minimal memory usage. The actual app is thus using 32 MiB.

withinboredom commented 8 months ago

That's awesome! So, 10 workers should be using a steady ~300mb in the perfect world (and scaling up linearly), which we are (clearly) not seeing.

I wonder if we could make some kind of status page or API call (similar to fpm's, maybe) that shows per-request memory usage, post-bootstrapped memory usage, current memory usage (since end of last request), number of requests processed, etc. Once we instrument it, then it would be a lot easier to actually locate where in the code to start searching...

ovrflo commented 8 months ago

A status page would certainly be useful. Earlier today I was wondering if it'd be possible to weave frankenphp metrics in caddy's prometheus metrics endpoint (which I'm already happily scraping). I tried 2-3 searches and I couldn't find much on the topic of a module exposing own metrics (that doesn't mean it's impossible, it just means 3 minutes of superficial searching didn't reveal a path for this). When that didn't pan out, I was thinking maybe I could change the frankenphp runtime to capture some metrics (I already kinda forked it). It wouldn't be pretty, or accurate, but it may be doable and good-enough™.

I've seen a symfony prometheus implementation that cleverly used redis for collecting the data (you update redis anytime you want to store a metric and later prometheus would scrape /metrics and you'd serve it the data from redis). That could work.

TLDR; Frankenphp could really use some instrumentation.

Again, thank you so much for the tool :) Maybe you should upstream it in frankenphp's repo? I may be the first (well, second) to need it, but I can't possibly be the last. Maybe upstream it into the runtime repo, as a tool attached to the runtime instead of the webserver.

dunglas commented 8 months ago

It's possible to add metrics to the existing Prometheus endpoint.

Here is how I've done that for Mercure:

dunglas commented 8 months ago

I created a tiny load test using K6 to try to reproduce the issue: https://github.com/dunglas/frankenphp/pull/392

So far, with a very minimal Caddyfile (included in the PR), I didn't manage to reproduce the issue (on Mac). Maybe it will appear by adding extra options like compression.

Would you mind trying this load test on your machine/config to see if you're able to reproduce the problem?

withinboredom commented 8 months ago

heaptrack.frankenphp.24362.gz

You can use heaptrack_gui (available on linux, probably Mac, too) to view this. This shows where the memory is leaking in cgo/go (it looks like there may be a minor leak in worker.go since a "dummy request" is created each time a worker restarts).

Gist of leaks detected:

  1. mostly in frankenphp_update_server_context: specifically in regards to ts_resource (leaks somewhere in zend_ini_refresh_caches, and zend_hash_copy -- so it might be a PHP leak).
  2. if I'm reading correctly, allocations in Go for C datastructures that are not freed. (such as the root directory to chdir to)

Unfortunately, it doesn't appear to capture all the proper threads, so it's missing request handling. I'll see if I can work that out if someone doesn't beat me to it. Hmmm. I managed to get some of the request threads, but they are too big to upload here. I'll take some smaller ones tomorrow or later this week.

dunglas commented 8 months ago

Thanks @withinboredom, it's super helpful!

For 1, the PHP executor is known to be complex and it's hard to track how it manages allocations. Not sure if it's a real leak or not. Don't hesitate to report to the PHP project so they will be able to figure this out. For 2, do you have identified which data structures?

withinboredom commented 8 months ago

Actually. Yes.

For (2), it's a red herring because I wrote up something that panics if something isn't freed by the end of the request. It makes the code rather messy because each pointer has to be "registered" with it, but I'd be happy to push it if it might be helpful. But anyway, it was all freed.

I managed to find out where all those gigabytes of memory are; they are in Caddy request contexts. So, I suspect we may be reusing request contexts from other requests, but I'm still chasing that down. I'll make sure to report back when I find something interesting.

withinboredom commented 8 months ago

So, there are indeed some small memory leaks, but I'm still trying to narrow them down (like 100kb-ish per request using echo "Hello world" as a test script and the executor script for worker mode). For some good news, generally restarting a worker every ~200 or so requests seems to help:

// random jitter to prevent all workers from restarting at the same time during load tests
$jitter = random_int(190, 220); 
for($counter = 0; $counter < $jitter; $counter++, frankenphp_handle_request($fn)) {
}
exit 0;

It's possible to get it to fit into 2Gi, but it is quite a challenge because Go is designed for something other than this use case. Once the available memory is used up, the GC thrashes, dropping max requests to 2% of the maximum.

So, why is nginx "so much more performant" in low-memory situations? It's likely because Nginx has complete control over memory. It is usually more efficient to reuse memory than to free and re-acquire the same memory repeatedly (which is what Go is doing and makes sense for general-purpose applications). However, memory is pretty cheap (unless you are in the cloud), so we can forego doing anything with it and sacrifice memory for speed.

Here's a docker command to forcibly fit Frankenphp in a 2Gi container at an (extreme) expense of performance:

docker run \
  --rm \
  -e GOMEMLIMIT=1000MiB \ # set a soft 1Gi memory limit
  -e GOGC=1 \ # sacrifice all performance for more memory
  -e MADV_DONTNEED=1 \ # give any freed memory back to the OS instead of reusing it
  --memory 2G \ # memory limit of container
  -p 80:80 \
  -p 443:443 \
  -v $(pwd):/app 
  -v $(pwd)/caddy/frankenphp/Caddyfile:/etc/caddy/Caddyfile \
  --pull always \
  dunglas/frankenphp

Also, the following kernel parameters are set:

echo "defer" > /sys/kernel/mm/transparent_hugepage/defrag
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
withinboredom commented 8 months ago

Merry Christmas! I'm back with more news!

During the investigation into the elusive memory leak in FrankenPHP, I discovered two underlying issues (the Create Timer crash in #440 and a fun one in #439), yet the mystery wasn't entirely done. I've been slamming FrankenPHP with load tests to discover all the different ways it can break, mostly in an attempt to discover the conditions in which it becomes the OOM Killer's best friend.

It appears (for lack of other evidence) that there is potentially a memory leak in Go's http2 implementation (or Caddy isn't handling something correctly) that is causing http2 frames to hang around way too long. This heap profile was taken after a few GCs and had previously handled ~100k requests. It was idle at the time.

With only 2GB of memory available, using http1 (SERVER_NAME=":80"), worker mode, disabling the docker healthchecks, and applying the GC settings above, ~can let you handle over half a million requests without an issue (with an elementary PHP file).~ and #442, you can handle nearly infinite requests without an issue.

FrankenPHP and Caddy are using the rest of the memory via request contexts, as each time a request context is modified, an entirely new copy is created.

I'm no Go genius, nor am I entirely familiar with how to hunt down memory leaks in Go, and what I do know are things I've learned over the past week. So, potentially, these graphs just show us something that maybe, someday, hopefully, will be GC'd, yet as far as I can tell, are never actually GC'd. However, I'm reasonably sure there are one of two things going on:

  1. Some objects hold onto http2 request frames/contexts that are no longer useful. These could be in logging objects, original request variables (I know caddy puts a reference to an original request in the context, which could hold a circular reference, but I only just thought of this and didn't investigate it), etc.
  2. During cgi-mode requests, far less memory is utilized, and no memory leak seems apparent (though until #440 is fixed, we can't run it long enough to tell for sure). Since worker mode is processing a request in a CGO callback, the memory being utilized (aka, the "request object") may become "tainted," and Go won't ever GC that memory. Only when #440 is fixed will it be easier to tell.

I'm leaning more towards (2), and this is why:

Go does some shenanigans via syscalls. Thus, I greatly suspect that all requests received by the FrankenPHP handler cannot be freed until the worker fully dies. I cannot find any documentation on that though...

While writing this, I had a eureka moment and opened a PR with a straightforward fix because we don't need caddy's context. I've opened PR #442 with the changes. It doesn't remove the memory leak entirely but gets rid of a large chunk of it.

bitbirddev commented 7 months ago

@ovrflo i do have a question but it's not related to this issue but maybe you can help me anyways. i've seen you ware able to build your frankenphp with the amqp extension - can you give me a hint on how to do that?

withinboredom commented 7 months ago

It's in the docs @bitbirddev:

bitbirddev commented 7 months ago

@withinboredom i've already tried it with

export PHP_EXTENSIONS=amqp
./build-static.sh

but it gives me an error

[09:56:10] [CRIT] ext [php-amqp] is not supported yet

amqp is also not listed here in the available extensions list thats why is was wondering...

can you maybe give me another hint?

withinboredom commented 7 months ago

is not supported yet

It's pretty clear that it isn't supported yet.

can you maybe give me another hint?

On the documentation page you linked to, it says to open a feature request if you want a new extension. I'd recommend opening a feature request or searching the issues there to see if it has been requested by anyone else.

In the future, please don't use closed issues as technical support. We monitor them in case of regressions, and this just adds to the noise. There is a Discussions tab at the top where members of the community help each other; and if you find a bug/missing feature in frankenphp, don't hesitate to open an issue.

bitbirddev commented 7 months ago

@withinboredom got your point, i'm sorry. i was just curious because in the first post from @ovrflo he lists his php modules and amqp is in the list - so i thought there has to be a way to do it.

ovrflo commented 7 months ago

Hi guys! Mad props to @dunglas and @withinboredom for debugging and fixing those crazy leaks. The memory usage is predictibly smooth at idle and it seems to successfully avoid the OOM reaper when under a limited stress test.

@bitbirddev I built my docker image with amqp, but it wasn't a static build (or the self-contained one).

ARG PHP_VERSION=8.3
ARG FRANKENPHP_VERSION=latest

FROM --platform=$TARGETPLATFORM dunglas/frankenphp:${FRANKENPHP_VERSION}-php${PHP_VERSION}-alpine AS build

ENV APP_ENV=prod

WORKDIR /srv/app

# persistent / runtime deps
RUN apk add --no-cache \
        acl \
        file \
        ghostscript \
        openssl \
        git \
    ;

RUN set -eux; \
    install-php-extensions \
        amqp \
        apcu \
        bcmath \
        calendar \
        exif \
        ffi \
        gd \
        gettext \
        imagick \
        intl \
        ldap \
        opcache \
        pcntl \
        pdo_mysql \
        redis \
        soap \
        sockets \
        xml \
        xsl \
        zip \
    ; \
    \
    echo "ffi.enable = 1" >> /usr/local/etc/php/conf.d/docker-php-ext-ffi.ini ;

RUN mv "$PHP_INI_DIR/php.ini-production" "$PHP_INI_DIR/php.ini"
COPY --link files/conf.d/app.ini $PHP_INI_DIR/conf.d/
COPY --link files/conf.d/app.prod.ini $PHP_INI_DIR/conf.d/

The result is a docker image with frankenphp and a bunch of php extensions installed as dynamic libraries.