Open KGOH opened 4 months ago
Is it possible that the backend isn't closing connections? What is myapp:8080
serving? How can we reproduce it?
If you could please collect a memory and goroutine profile that would help answer some questions: https://caddyserver.com/docs/profiling
Thank you
@mholt myapp
is a REST API server. Currently I'm not sure how exactly you can reproduce it without me sharing the whole app (which I can't do)
Here are memory and goroutine profiles goroutines.txt heap_debug=1.gz
here's an example of memory graph on the machine I got these profiles from
Thanks -- interesting... what kind of traffic is your server getting?
The top heaps and goroutines are simply growing buffers to read from TLS connections, and serving HTTP/2, respectively.
Goroutines:
694 @ 0x43e44e 0x44e905 0xb234cf 0xb217cc 0xb20bbc 0x769b93 0x471901
# 0xb234ce golang.org/x/net/http2.(*serverConn).serve+0x88e golang.org/x/net@v0.17.0/http2/server.go:940
# 0xb217cb golang.org/x/net/http2.(*Server).ServeConn+0xbcb golang.org/x/net@v0.17.0/http2/server.go:531
# 0xb20bbb golang.org/x/net/http2.ConfigureServer.func1+0xfb golang.org/x/net@v0.17.0/http2/server.go:321
# 0x769b92 net/http.(*conn).serve+0x1212 net/http/server.go:1917
690 @ 0x43e44e 0x436c97 0x46bce5 0x4e11c7 0x4e24ba 0x4e24a8 0x545825 0x558ea5 0x69841b 0x50dc38 0x6985fe 0x695bd0 0x69be98 0x69bea1 0x4b8350 0xb16fe5 0xb16fb4 0xb17725 0xb22667 0x471901
# 0x46bce4 internal/poll.runtime_pollWait+0x84 runtime/netpoll.go:343
# 0x4e11c6 internal/poll.(*pollDesc).wait+0x26 internal/poll/fd_poll_runtime.go:84
# 0x4e24b9 internal/poll.(*pollDesc).waitRead+0x279 internal/poll/fd_poll_runtime.go:89
# 0x4e24a7 internal/poll.(*FD).Read+0x267 internal/poll/fd_unix.go:164
# 0x545824 net.(*netFD).Read+0x24 net/fd_posix.go:55
# 0x558ea4 net.(*conn).Read+0x44 net/net.go:179
# 0x69841a crypto/tls.(*atLeastReader).Read+0x3a crypto/tls/conn.go:805
# 0x50dc37 bytes.(*Buffer).ReadFrom+0x97 bytes/buffer.go:211
# 0x6985fd crypto/tls.(*Conn).readFromUntil+0xdd crypto/tls/conn.go:827
# 0x695bcf crypto/tls.(*Conn).readRecordOrCCS+0x24f crypto/tls/conn.go:625
# 0x69be97 crypto/tls.(*Conn).readRecord+0x157 crypto/tls/conn.go:587
# 0x69bea0 crypto/tls.(*Conn).Read+0x160 crypto/tls/conn.go:1369
# 0x4b834f io.ReadAtLeast+0x8f io/io.go:335
# 0xb16fe4 io.ReadFull+0x64 io/io.go:354
# 0xb16fb3 golang.org/x/net/http2.readFrameHeader+0x33 golang.org/x/net@v0.17.0/http2/frame.go:237
# 0xb17724 golang.org/x/net/http2.(*Framer).ReadFrame+0x84 golang.org/x/net@v0.17.0/http2/frame.go:498
# 0xb22666 golang.org/x/net/http2.(*serverConn).readFrames+0x86 golang.org/x/net@v0.17.0/http2/server.go:820
Heaps:
261: 10690560 [3380: 138444800] @ 0x50ddce 0x50d7fd 0x698594 0x698557 0x6964ff 0x69be98 0x69bea1 0x76322b 0x56d763 0x4b89a3 0x7773bb 0x777298 0x4b89a3 0x4b8806 0x55a386 0x55a343 0x5611ab 0x55fcf0 0x4b87a7 0x781d55 0x781d0f 0x56f74b 0x4b87a7 0x7759a8 0x77598f 0x775525 0x75ccad 0x7849ef 0x471901
# 0x50ddcd bytes.growSlice+0x8d bytes/buffer.go:249
# 0x50d7fc bytes.(*Buffer).grow+0x13c bytes/buffer.go:151
# 0x698593 bytes.(*Buffer).Grow+0x73 bytes/buffer.go:168
# 0x698556 crypto/tls.(*Conn).readFromUntil+0x36 crypto/tls/conn.go:826
# 0x6964fe crypto/tls.(*Conn).readRecordOrCCS+0xb7e crypto/tls/conn.go:676
# 0x69be97 crypto/tls.(*Conn).readRecord+0x157 crypto/tls/conn.go:587
# 0x69bea0 crypto/tls.(*Conn).Read+0x160 crypto/tls/conn.go:1369
# 0x76322a net/http.(*connReader).Read+0x14a net/http/server.go:791
# 0x56d762 bufio.(*Reader).Read+0xe2 bufio/bufio.go:230
# 0x4b89a2 io.(*LimitedReader).Read+0x42 io/io.go:480
# 0x7773ba net/http.(*body).readLocked+0x3a net/http/transfer.go:839
# 0x777297 net/http.(*body).Read+0x117 net/http/transfer.go:831
# 0x4b89a2 io.(*LimitedReader).Read+0x42 io/io.go:480
# 0x4b8805 io.copyBuffer+0x1a5 io/io.go:430
# 0x55a385 io.Copy+0x65 io/io.go:389
# 0x55a342 net.genericReadFrom+0x22 net/net.go:671
# 0x5611aa net.(*TCPConn).readFrom+0x6a net/tcpsock_posix.go:54
# 0x55fcef net.(*TCPConn).ReadFrom+0x2f net/tcpsock.go:130
# 0x4b87a6 io.copyBuffer+0x146 io/io.go:416
# 0x781d54 io.Copy+0x54 io/io.go:389
# 0x781d0e net/http.persistConnWriter.ReadFrom+0xe net/http/transport.go:1801
# 0x56f74a bufio.(*Writer).ReadFrom+0x18a bufio/bufio.go:797
# 0x4b87a6 io.copyBuffer+0x146 io/io.go:416
# 0x7759a7 io.Copy+0x47 io/io.go:389
# 0x77598e net/http.(*transferWriter).doBodyCopy+0x2e net/http/transfer.go:412
# 0x775524 net/http.(*transferWriter).writeBody+0x3c4 net/http/transfer.go:370
# 0x75ccac net/http.(*Request).write+0xbac net/http/request.go:738
# 0x7849ee net/http.(*persistConn).writeLoop+0x18e net/http/transport.go:2424
14: 573440 [7879: 322723840] @ 0x50ddce 0x50d7fd 0x698594 0x698557 0x6964ff 0x69be98 0x69bea1 0x4b8350 0xb16fe5 0xb16fb4 0xb17725 0xb22667 0x471901
# 0x50ddcd bytes.growSlice+0x8d bytes/buffer.go:249
# 0x50d7fc bytes.(*Buffer).grow+0x13c bytes/buffer.go:151
# 0x698593 bytes.(*Buffer).Grow+0x73 bytes/buffer.go:168
# 0x698556 crypto/tls.(*Conn).readFromUntil+0x36 crypto/tls/conn.go:826
# 0x6964fe crypto/tls.(*Conn).readRecordOrCCS+0xb7e crypto/tls/conn.go:676
# 0x69be97 crypto/tls.(*Conn).readRecord+0x157 crypto/tls/conn.go:587
# 0x69bea0 crypto/tls.(*Conn).Read+0x160 crypto/tls/conn.go:1369
# 0x4b834f io.ReadAtLeast+0x8f io/io.go:335
# 0xb16fe4 io.ReadFull+0x64 io/io.go:354
# 0xb16fb3 golang.org/x/net/http2.readFrameHeader+0x33 golang.org/x/net@v0.17.0/http2/frame.go:237
# 0xb17724 golang.org/x/net/http2.(*Framer).ReadFrame+0x84 golang.org/x/net@v0.17.0/http2/frame.go:498
# 0xb22666 golang.org/x/net/http2.(*serverConn).readFrames+0x86 golang.org/x/net@v0.17.0/http2/server.go:820
...
There's no memory leak here... just normal traffic. Not even a lot of traffic.
What does top
or htop
on the machine show?
I'd also be interested if you can reproduce this without the API backend.
@mholt can this imply that my app doesn't close connections?
I'll get back to you with htop
later
Quite possibly. It's a bit unfortunate that the graphs only show a single process. That's why I'm curious what the output of top
is.
@mholt, hello. The app is a java jetty server that has idle timeout of 200000ms (about 3.5 minutes) (doc)
Here's a graph for the java app, grafana and caddy processes for the last month. As you can see, there was a restart after a certain memory threshold. The java app doesn't seem to constantly grow in memory. The java app receives mobile traffic, that probably can drop due to unstable internet connection
As I can see in the caddy docs, caddy's default request idle timeout is 5 minutes https://caddyserver.com/docs/caddyfile/options#timeouts This doesn't explain why we see the growing memory consumption
What are your clients / traffic patterns like?
Try setting some timeouts: https://caddyserver.com/docs/caddyfile/options#timeouts
The majority of the (few) goroutines that show up in the profiles are waiting for reads.
Timeouts didn't help (the blue line that ends abruptly is caddy before I added timeouts, the red line is caddy after I restarted it with timeouts config added)
I added the following lines to Caddyfile
{
servers {
timeouts {
read_body 5m
read_header 5m
write 5m
idle 5m
}
}
}
I'm not sure how to describe traffic pattern. In short there are a lot of mobile connections. I guess they can drop off without properly closing connections. At the same time I observe a big number of such exceptions: ~15000 in 14 days: org.eclipse.jetty.io.EofException Early EOF
I'm not sure whether this can be related. On my memory usage graphs you can also notice that java memory sometimes is growing somewhat similarly (not here though!)
Right now I'm also observing a memory leak within java
For the sake of context, here's another issue I created in the jetty server repo https://github.com/jetty/jetty.project/issues/11858
Interesting... one other thought I have is, what if you disable HTTP/3?
Has anyone been able to replicate this problem? Does it happen with 2.8.4?
Just had also a server memory swap detected due to Caddy after upgrading to 2.8.4. It seems it is still there (and on my side, this is only after upgrading from 2.7.6 to 2.8.4 that I observe it). Don't know if it can be also related, but I also use Caraza WAF module (v2.0.0-rc.3)
Don't know if it can be also related, but I also use Caraza WAF module (v2.0.0-rc.3)
Very much could be related. We can't really take builds with plugins as any kind of evidence because the problem could be in the plugin.
@zguig52 A profile would help if you could get one. https://caddyserver.com/docs/profiling
Since my java app also has memory growing I replaced caddy with nginx to test whether the issue is in caddy or in java. I've been observing the memory consumption for a week now and I see that the java app memory consumption growth rate has decreased. Interesting
We have default keepalive
config while nginx doesn't. Can you try setting the keepalive*
config in reverse_proxy
(docs) to something suitable for your app workload and monitor the memory consumption? You did mention that your upstream app doesn't close connections
Jetty follows the HTTP/1.0 (RFC1945) and HTTP/1.1 (RFC9112) specs.
If the client says HTTP/1.0 in its request header ..
Connection: keep-alive
request header.Connection: close
with a HTTP/1.0 request is a no-op as the value close
is not recognized by the HTTP/1.0 spec.If the client says HTTP/1.1 in its request header ...
Connection: close
request header.Connection: keep-alive
with a HTTP/1.1 request is a no-op as the value keep-alive
is not recognized by the HTTP/1.1 spec.If you want to see a more comprehensive set of tests related to the Connection
header in Jetty (not limited to just close
and keep-alive
values), see the testcases.
That shows the 4 piece of information Jetty uses to determine connection behavior. (the user-agent request HTTP spec version, the user-agent request Connection header value, the application specified response Connection header value, and the user-agent seen response Connection header value)
I have a minimal config, and I constantly observe leaking memory. I used version 2.5.0 and later upgrated to 2.7, but this didn't help
Caddy is deployed via docker compose, here's a snippet: