F1bonacc1 / process-compose

Process Compose is a simple and flexible scheduler and orchestrator to manage non-containerized applications.
https://f1bonacc1.github.io/process-compose/
Apache License 2.0
1.32k stars 51 forks source link

Process compose killed on Windows Server 2022 #107

Closed ivansenic closed 11 months ago

ivansenic commented 11 months ago

I am not sure if this has anything to do with the machine itself or the process-compose, but I would like to explain what's happening and maybe somebody can help us.

So we run process-compose on several Windows Server 2022 machines. It's very strange, but we noticed that at some point the process-compose process dies and thus all processes created by it are killed as well. There is nothing we can find in logs nor in the other resources, except this Caught terminated message:

23-11-07 23:08:40.551 INF Caught terminated - Shutting down the running processes...

Now we are sure that there is no restart of the machine happening.

What's really funny is that since we have automated the whole process of stopping and starting the process-compose on all the machines we are using, this Caught terminated happens on all the machines at the approximately same time. Furthermore, I was able to confirm that the last time this happen, it took exactly 10 days from the start of the process to the end:

Our monitoring graph below showcases this (non-blur parts):

Screenshot 2023-11-22 at 13-22-54 Monitoring – plainly-eu – Google Cloud console

Now what can be the cause for this, seems it's happening regularly. We will continue to monitor, if the 10 days theory is correct, next shutdown will occur on ~Nov 30th 1AM.

Other details:

wilfried-huss commented 11 months ago

We also observed a slow memory leak, which in our case results in process-compose getting killed by the Linux kernel after 8-9 days.

I've collected profiling data using pprof.

Here is the memory usage right after starting:

Showing top 10 nodes out of 57
      flat  flat%   sum%        cum   cum%
 9733.45kB 69.77% 69.77%  9733.45kB 69.77%  golang.org/x/net/webdav.(*memFile).Write
 1133.89kB  8.13% 77.90%  1133.89kB  8.13%  github.com/gdamore/tcell/v2.(*CellBuffer).Resize
  518.02kB  3.71% 81.62%   518.02kB  3.71%  crypto/x509.(*CertPool).addCertFunc
     514kB  3.68% 85.30%      514kB  3.68%  bufio.(*Scanner).Scan
  513.50kB  3.68% 88.98%   513.50kB  3.68%  github.com/gdamore/tcell/v2/terminfo/v/vt220.init.0
  512.69kB  3.68% 92.66%   512.69kB  3.68%  github.com/gdamore/tcell/v2.map.init.3
  512.31kB  3.67% 96.33%   512.31kB  3.67%  regexp/syntax.(*compiler).inst
  512.02kB  3.67%   100%   512.02kB  3.67%  crypto/internal/nistec.NewP384Point (inline)
         0     0%   100%   512.02kB  3.67%  crypto/ecdsa.VerifyASN1
         0     0%   100%   512.02kB  3.67%  crypto/ecdsa.verifyNISTEC[go.shape.*uint8]

After running process-compose for a couple of hours this changes to

      flat  flat%   sum%        cum   cum%
11268.47kB 37.98% 37.98% 11268.47kB 37.98%  runtime.malg
 9733.45kB 32.81% 70.79%  9733.45kB 32.81%  golang.org/x/net/webdav.(*memFile).Write
 1702.26kB  5.74% 76.52%  1702.26kB  5.74%  github.com/gdamore/tcell/v2.(*CellBuffer).Resize
 1024.09kB  3.45% 79.98%  1024.09kB  3.45%  github.com/rivo/tview.(*Application).QueueUpdate
 1024.09kB  3.45% 83.43%  1024.09kB  3.45%  net/http.(*persistConn).roundTrip
  809.97kB  2.73% 86.16%   809.97kB  2.73%  bytes.growSlice
  518.02kB  1.75% 87.90%   518.02kB  1.75%  crypto/x509.(*CertPool).addCertFunc
     514kB  1.73% 89.64%      514kB  1.73%  bufio.(*Scanner).Scan
  513.50kB  1.73% 91.37%   513.50kB  1.73%  github.com/gdamore/tcell/v2/terminfo/v/vt220.init.0
  512.69kB  1.73% 93.10%   512.69kB  1.73%  github.com/gdamore/tcell/v2.map.init.3

Notice that runtime.malg now taks up 38% percent of memory, while it was not in the top 10 in the first profile. During the execution it can be observed that the memory usage of runtime.malg slowly but steadily grows.

According to this article this indicates that channels are not closed properly.

F1bonacc1 commented 11 months ago

Thank you for the detailed analysis @wilfried-huss I will look into that.

ivansenic commented 11 months ago

Just to report that it seems that our issue was that Windows task scheduler was killing the process after exiting the idle state. We started the process as the scheduled start, and by default Windows will set <StopOnIdle>true</StopOnIdle>.

We are monitoring if a change to this flag is working. So far it is..

F1bonacc1 commented 11 months ago

Thank you for the update @ivansenic. As an anecdote - one of my running instances of process compose has been running for more than 170 days nonstop on an old, 1GB Raspeby PI 😃 image

F1bonacc1 commented 11 months ago

@wilfried-huss, I've been running process-compose with pprof for a few hours and couldn't replicate either memory or channel leak.

Having said that some factors can affect that:

  1. Number of processes.
  2. Number of health probes.
  3. Number of processes in daemon mode.
  4. Amount of logs written by each process.
  5. Number of restarts (state changes) of the running processes.

Will you be able to fill those gaps? It will help me to replicate your scenario better. Ideally, if you could share your compose.yaml it would be the best.

wilfried-huss commented 11 months ago

Thanks for trying to reproduce the issue. I feared that it might depend on the configuration.

Here is a graph showing the memory usage for one of our process-compose instances:

process-compose-memory

Here is also the process-compose.yml file we use.

It runs Redis, three Python webservers (uvicorn + FastAPI) and two more Python processes that make HTTP and RPyC requests to one of the webservers.

version: "0.5"

log_level: debug
log_length: 1000

processes:
  redis:
    command: redis-server
    readiness_probe:
      exec:
        command: "redis-cli GET test"
      initial_delay_seconds: 1
      period_seconds: 1
      timeout_seconds: 1
      success_threshold: 1
      failure_threshold: 20

  webserver-a:
    command: webserver-a
    readiness_probe:
      http_get:
        host: localhost
        scheme: http
        path: "/a/health"
        port: 8000
      initial_delay_seconds: 1
      period_seconds: 1
      timeout_seconds: 1
      success_threshold: 1
      failure_threshold: 20
    depends_on:
      redis:
        condition: process_healthy

  webserver-b:
    command: webserver-b
    readiness_probe:
      http_get:
        host: localhost
        scheme: http
        path: "/b/health"
        port: 3001
      initial_delay_seconds: 1
      period_seconds: 1
      timeout_seconds: 1
      success_threshold: 1
      failure_threshold: 20
    depends_on:
      redis:
        condition: process_started
    shutdown:
      parent_only: true

  rpyc-client:
    command: rpyc-client
    depends_on:
      webserver-b:
        condition: process_healthy
    shutdown:
      signal: 1 # SIGHUP
      timeout_seconds: 5
    availability:
      restart: on_failure
      backoff_seconds: 10
      max_restarts: 3

  webserver-c:
    command: webserver-c
    readiness_probe:
      http_get:
        host: localhost
        scheme: http
        path: "/c/health"
        port: 3002
      initial_delay_seconds: 20
      period_seconds: 30
      timeout_seconds: 10
      success_threshold: 1
      failure_threshold: 4
    depends_on:
      webserver-b:
        condition: process_healthy
    availability:
      restart: on_failure
      backoff_seconds: 10
      max_restarts: 3

  httpclient:
    command: httpclient
    depends-on:
      webserver-b:
        condition: process_healthy

In production the setup produces quite a lot of logs, but I could reproduce the memory leak also when the system was idle and barely produced any log output.

I will try to come up with a self contained example, to make the issue easier to reproduce.

Thanks again, for looking into the problem!

F1bonacc1 commented 11 months ago

Updates.

Testing Configuration ### This configuration is close enough to what you posted ```yaml version: "0.5" log_level: debug log_length: 1000 processes: redis: command: ./test_loop.bash readiness_probe: exec: command: "pidof process-compose" initial_delay_seconds: 1 period_seconds: 1 timeout_seconds: 1 success_threshold: 1 failure_threshold: 20 webserver-a: command: ./test_loop.bash readiness_probe: http_get: host: localhost scheme: http path: "/live" port: 8080 initial_delay_seconds: 1 period_seconds: 1 timeout_seconds: 1 success_threshold: 1 failure_threshold: 20 depends_on: redis: condition: process_healthy webserver-b: command: ./test_loop.bash readiness_probe: http_get: host: localhost scheme: http path: "/live" port: 8080 initial_delay_seconds: 1 period_seconds: 1 timeout_seconds: 1 success_threshold: 1 failure_threshold: 20 depends_on: redis: condition: process_started shutdown: parent_only: true rpyc-client: command: ./test_loop.bash depends_on: webserver-b: condition: process_healthy shutdown: signal: 1 # SIGHUP timeout_seconds: 5 availability: restart: on_failure backoff_seconds: 10 max_restarts: 3 webserver-c: command: ./test_loop.bash readiness_probe: http_get: host: localhost scheme: http path: "/live" port: 8080 initial_delay_seconds: 20 period_seconds: 30 timeout_seconds: 10 success_threshold: 1 failure_threshold: 4 depends_on: webserver-b: condition: process_healthy availability: restart: on_failure backoff_seconds: 60 max_restarts: 3 httpclient: command: ./test_loop.bash depends_on: webserver-b: condition: process_healthy pc_log: command: "tail -f -n100 process-compose-${USER}.log" working_dir: "/tmp" namespace: debug memory: command: "./bin/process-compose project state --with-memory" availability: restart: always backoff_seconds: 10 namespace: debug ```

pprof top after 4 hours

Results ```shell > go tool pprof -alloc_space heap.out File: process-compose Type: alloc_space Time: Dec 2, 2023 at 2:32am (IST) Duration: 30.01s, Total samples = 206.33MB Entering interactive mode (type "help" for commands, "o" for options) (pprof) top Showing nodes accounting for 199.44MB, 96.66% of 206.33MB total Dropped 72 nodes (cum <= 1.03MB) Showing top 10 nodes out of 26 flat flat% sum% cum cum% 187.04MB 90.65% 90.65% 187.04MB 90.65% github.com/rivo/tview.(*TextView).parseAhead 5.90MB 2.86% 93.51% 5.90MB 2.86% strings.(*Builder).WriteString (inline) 3MB 1.45% 94.97% 3.50MB 1.70% github.com/rivo/tview.step 3MB 1.45% 96.42% 196.44MB 95.21% github.com/rivo/tview.(*TextView).Draw 0.50MB 0.24% 96.66% 3.50MB 1.70% github.com/rivo/tview.(*Table).Draw 0 0% 96.66% 2MB 0.97% github.com/InVisionApp/go-health/v2.(*Health).startRunner.func1 0 0% 96.66% 2MB 0.97% github.com/InVisionApp/go-health/v2.(*Health).startRunner.func2 0 0% 96.66% 200.44MB 97.15% github.com/f1bonacc1/process-compose/src/cmd.startTui 0 0% 96.66% 140.93MB 68.30% github.com/f1bonacc1/process-compose/src/tui.(*pcView).updateLogs.(*Application).QueueUpdateDraw.func4 0 0% 96.66% 59.51MB 28.84% github.com/f1bonacc1/process-compose/src/tui.(*pcView).updateTable.(*Application).QueueUpdateDraw.func4 (pprof) ```

I also added a remote command to monitor the heap size and some other parameters.

After 6m

start

After 4h

2023-12-02_02-26

I will leave it running on my RPI for a few days for monitoring.

A few more questions @wilfried-huss:

  1. Which version did you use for the analysis?
  2. Does the graph that you posted show Resident or Virtual memory?
  3. The readiness probes that you posted use HTTP scheme. Are they HTTPS in the original file?

BTW there is a small typo in the httpclient:

  httpclient:
    command: httpclient
    depends-on:   # <-- should be depends_on
      webserver-b:
        condition: process_healthy
wilfried-huss commented 11 months ago

The problem seems to be connected with the readiness probes. If I remove them the memory leak goes away.

@F1bonacc1 to answer your questions:

  1. I checked with version v0.69.0 and v0.75.1. They both show the same behavior.
  2. The graph shows virtual memory.
  3. The readiness probes are just HTTP on localhost.
F1bonacc1 commented 11 months ago

Thank you @wilfried-huss, that narrowed it down for me. The fix will be part of the next release.

F1bonacc1 commented 11 months ago

Fixed in v0.77.4