Closed ivansenic closed 11 months ago
We also observed a slow memory leak, which in our case results in process-compose
getting killed by the Linux kernel after 8-9 days.
I've collected profiling data using pprof.
Here is the memory usage right after starting:
Showing top 10 nodes out of 57
flat flat% sum% cum cum%
9733.45kB 69.77% 69.77% 9733.45kB 69.77% golang.org/x/net/webdav.(*memFile).Write
1133.89kB 8.13% 77.90% 1133.89kB 8.13% github.com/gdamore/tcell/v2.(*CellBuffer).Resize
518.02kB 3.71% 81.62% 518.02kB 3.71% crypto/x509.(*CertPool).addCertFunc
514kB 3.68% 85.30% 514kB 3.68% bufio.(*Scanner).Scan
513.50kB 3.68% 88.98% 513.50kB 3.68% github.com/gdamore/tcell/v2/terminfo/v/vt220.init.0
512.69kB 3.68% 92.66% 512.69kB 3.68% github.com/gdamore/tcell/v2.map.init.3
512.31kB 3.67% 96.33% 512.31kB 3.67% regexp/syntax.(*compiler).inst
512.02kB 3.67% 100% 512.02kB 3.67% crypto/internal/nistec.NewP384Point (inline)
0 0% 100% 512.02kB 3.67% crypto/ecdsa.VerifyASN1
0 0% 100% 512.02kB 3.67% crypto/ecdsa.verifyNISTEC[go.shape.*uint8]
After running process-compose
for a couple of hours this changes to
flat flat% sum% cum cum%
11268.47kB 37.98% 37.98% 11268.47kB 37.98% runtime.malg
9733.45kB 32.81% 70.79% 9733.45kB 32.81% golang.org/x/net/webdav.(*memFile).Write
1702.26kB 5.74% 76.52% 1702.26kB 5.74% github.com/gdamore/tcell/v2.(*CellBuffer).Resize
1024.09kB 3.45% 79.98% 1024.09kB 3.45% github.com/rivo/tview.(*Application).QueueUpdate
1024.09kB 3.45% 83.43% 1024.09kB 3.45% net/http.(*persistConn).roundTrip
809.97kB 2.73% 86.16% 809.97kB 2.73% bytes.growSlice
518.02kB 1.75% 87.90% 518.02kB 1.75% crypto/x509.(*CertPool).addCertFunc
514kB 1.73% 89.64% 514kB 1.73% bufio.(*Scanner).Scan
513.50kB 1.73% 91.37% 513.50kB 1.73% github.com/gdamore/tcell/v2/terminfo/v/vt220.init.0
512.69kB 1.73% 93.10% 512.69kB 1.73% github.com/gdamore/tcell/v2.map.init.3
Notice that runtime.malg
now taks up 38% percent of memory, while it was not in the top 10 in the first profile.
During the execution it can be observed that the memory usage of runtime.malg
slowly but steadily grows.
According to this article this indicates that channels are not closed properly.
Thank you for the detailed analysis @wilfried-huss I will look into that.
Just to report that it seems that our issue was that Windows task scheduler was killing the process after exiting the idle state. We started the process as the scheduled start, and by default Windows will set <StopOnIdle>true</StopOnIdle>
.
We are monitoring if a change to this flag is working. So far it is..
Thank you for the update @ivansenic. As an anecdote - one of my running instances of process compose has been running for more than 170 days nonstop on an old, 1GB Raspeby PI 😃
@wilfried-huss,
I've been running process-compose
with pprof for a few hours and couldn't replicate either memory or channel leak.
Having said that some factors can affect that:
Will you be able to fill those gaps? It will help me to replicate your scenario better.
Ideally, if you could share your compose.yaml
it would be the best.
Thanks for trying to reproduce the issue. I feared that it might depend on the configuration.
Here is a graph showing the memory usage for one of our process-compose instances:
Here is also the process-compose.yml
file we use.
It runs Redis, three Python webservers (uvicorn + FastAPI) and two more Python processes that make HTTP and RPyC requests to one of the webservers.
version: "0.5"
log_level: debug
log_length: 1000
processes:
redis:
command: redis-server
readiness_probe:
exec:
command: "redis-cli GET test"
initial_delay_seconds: 1
period_seconds: 1
timeout_seconds: 1
success_threshold: 1
failure_threshold: 20
webserver-a:
command: webserver-a
readiness_probe:
http_get:
host: localhost
scheme: http
path: "/a/health"
port: 8000
initial_delay_seconds: 1
period_seconds: 1
timeout_seconds: 1
success_threshold: 1
failure_threshold: 20
depends_on:
redis:
condition: process_healthy
webserver-b:
command: webserver-b
readiness_probe:
http_get:
host: localhost
scheme: http
path: "/b/health"
port: 3001
initial_delay_seconds: 1
period_seconds: 1
timeout_seconds: 1
success_threshold: 1
failure_threshold: 20
depends_on:
redis:
condition: process_started
shutdown:
parent_only: true
rpyc-client:
command: rpyc-client
depends_on:
webserver-b:
condition: process_healthy
shutdown:
signal: 1 # SIGHUP
timeout_seconds: 5
availability:
restart: on_failure
backoff_seconds: 10
max_restarts: 3
webserver-c:
command: webserver-c
readiness_probe:
http_get:
host: localhost
scheme: http
path: "/c/health"
port: 3002
initial_delay_seconds: 20
period_seconds: 30
timeout_seconds: 10
success_threshold: 1
failure_threshold: 4
depends_on:
webserver-b:
condition: process_healthy
availability:
restart: on_failure
backoff_seconds: 10
max_restarts: 3
httpclient:
command: httpclient
depends-on:
webserver-b:
condition: process_healthy
In production the setup produces quite a lot of logs, but I could reproduce the memory leak also when the system was idle and barely produced any log output.
I will try to come up with a self contained example, to make the issue easier to reproduce.
Thanks again, for looking into the problem!
Updates.
pprof
top after 4 hoursI also added a remote command to monitor the heap size and some other parameters.
I will leave it running on my RPI for a few days for monitoring.
A few more questions @wilfried-huss:
BTW there is a small typo in the httpclient
:
httpclient:
command: httpclient
depends-on: # <-- should be depends_on
webserver-b:
condition: process_healthy
The problem seems to be connected with the readiness probes. If I remove them the memory leak goes away.
@F1bonacc1 to answer your questions:
Thank you @wilfried-huss, that narrowed it down for me. The fix will be part of the next release.
Fixed in v0.77.4
I am not sure if this has anything to do with the machine itself or the
process-compose
, but I would like to explain what's happening and maybe somebody can help us.So we run
process-compose
on several Windows Server 2022 machines. It's very strange, but we noticed that at some point theprocess-compose
process dies and thus all processes created by it are killed as well. There is nothing we can find in logs nor in the other resources, except thisCaught terminated
message:Now we are sure that there is no restart of the machine happening.
What's really funny is that since we have automated the whole process of stopping and starting the process-compose on all the machines we are using, this
Caught terminated
happens on all the machines at the approximately same time. Furthermore, I was able to confirm that the last time this happen, it took exactly 10 days from the start of the process to the end:~ Nov 9th 3.00PM
~ Nov 19th 3.00PM
Our monitoring graph below showcases this (non-blur parts):
Now what can be the cause for this, seems it's happening regularly. We will continue to monitor, if the 10 days theory is correct, next shutdown will occur on
~Nov 30th 1AM
.Other details:
v0.69.0
$env:PC_LOG_FILE="C:\Users\..\logs\process-compose.log"; C:\Users\..\process-compose\process-compose.exe up -f C:\Users\..\process-compose\process-compose.yaml