high cpu usage with only two services managed

cyrinux commented 6 months ago

First thanks for this cool toy 😃

Describe the bug High CPU usage

Context

Sablier version: 1.6.0
Provider: docker v20.10.23
Reverse proxy: traefik v2.11
Sablier running inside a container

Expected behavior Low cpu usage

Additional context I manage 2 services in a big docker-compose files with sablier, and when I do a top I can see sablier process very hard with my tiny cpu. When I restart it, the process become cool, until an unknown event make it hard with the cpu.

Is it a know issue ? I can see there is a full rewrite PR in progress, this can help ? (I didnt dig in the code actually)

kraoc commented 3 months ago

I'll add a comment here.

I have several VMs with Sablier. I notice an increase in its cpu consumption without apparent reason.

When I restart Sablier, its cpu consumption returns to normal.

acouvreur commented 3 months ago

Hello @kraoc, do you have more inputs on that ?

If you have maybe CPU usage graphs, or specific events after which you noticed that ?

I will investigate this issue for sure when I have time, but more input to narrow down the issue will help!

kraoc commented 3 months ago

I'll try to add more usefull debug tips :p (but I need to dig a bit)

My first think point to a potential leak on first start (eg. on a vm boot) ; because after a restart of sablier it's cpu consumption is ok. Dunno if I'm making myself clear :p

I'm not talking of Traefik plugin but acouvreur/sablier:latest (maybe a precision)

luked34 commented 3 months ago

I have the same issue. I am running in Docker Desktop on Ubuntu, with the CPU limit under "Resources" set to 4.

After restarting the container the CPU usage is almost zero. But about 4 minutes later it ramps up to just over 100% (the maximum available is 400%). This happens whether or not I send any requests through Sablier. The usage is coming from the main Sablier container, not the proxy (I use Caddy).

It seems to work fine even when the CPU usage is high. I can't see anything useful in the logs, but let me know if there's anything else I can provide.

cyrinux commented 3 months ago

I still have the bug too, even with the last version, always same behavior, after restart its cool. And when I hit one of the endpoint manage by sablier then the process consumne lot of cpu. I tried to tweak some sablier params but this dont really help.

acouvreur commented 3 months ago

I'd be curious to have more information to try to reproduce the issue. What I'd need would be the following:

OS (Windows, Mac, linux)
OS Version
Sablier version (you can output the result of sablier version, even with docker)
Where is Sablier running ? (binary, docker, and if it's docker, is it in docker desktop ?)
The configuration file for sablier
The endpoint you are reaching and the parameter used after which you see this CPU usage.

I've tried to reproduce the issue but even after multiple calls, I cannot go over 0.3% CPU usage.

I'm currently running a version of the program with CPU profile, memory profiling and trace profiling.

If I can better reproduce your use cases, I'll be able to better understand why it behaves this way.

Thanks!

cyrinux commented 3 months ago

Hi @acouvreur,

For me same It's always https://github.com/acouvreur/sablier/issues/272#issue-2225340281 but with last sablier version.

Some self-hosted service on a synology DS918+ with an Intel(R) Celeron(R) CPU J3455 @ 1.50GHz.

I have around 60 containers running, if the docker events flow matter.

(version=1.7.0, branch=HEAD, revision=0f58751390f2439f03c851f76ef9b9a410346b58)

I have config in config in dynamic-config.toml and some labels in my docker-compose.yaml

Here few extract, I have 4 endpoint managed by sablier, I try playing with timeout and session time, for sure, sometimes, the problem doesnt trigger instant, but often this happens instantly.

dynamic-config.toml

...
[http.routers.gethomepage]
  rule = "Host(`foo.bar`)"
  service = "gethomepage"
  entryPoints = ["https"]
  middlewares = ["middleflare@docker", "sablier-gethomepage@docker", "authelia@docker", "compress@docker", "security-headers@docker"]
  [http.routers.gethomepage.tls]
    certresolver = "default"
    options = "default"
...

docker-compose.yml

...
  sablier:
    image: acouvreur/sablier:1.7.0
    container_name: sablier
    mem_limit: 256m
    cpu_shares: 1024
    networks:
    - traefik
    environment:
    - PROVIDER_NAME=docker
    labels:
      com.centurylinklabs.watchtower.enable: "true"
      deunhealth.restart.on.unhealthy: "true"
      traefik.enable: "true"
      # Group Middleware aur
      traefik.http.middlewares.sablier-aur.plugin.sablier.group: aur
      traefik.http.middlewares.sablier-aur.plugin.sablier.sablierUrl: http://sablier:10000
      traefik.http.middlewares.sablier-aur.plugin.sablier.sessionDuration: 720m
      traefik.http.middlewares.sablier-aur.plugin.sablier.blocking.timeout: 30s
      # Group Middleware gethomepage
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.group: gethomepage
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.sablierUrl: http://sablier:10000
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.sessionDuration: 120m
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.dynamic.theme: ghost
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.dynamic.showDetails: "false"

  gethomepage:
    image: ghcr.io/gethomepage/homepage
    container_name: gethomepage
    volumes:
    - /volume2/docker/homepage/config:/app/config:rw
    networks:
    - traefik
    depends_on:
    - traefik
    restart: unless-stopped
    labels:
      com.centurylinklabs.watchtower.enable: "true"
      deunhealth.restart.on.unhealthy: "false"
      sablier.enable: true
      sablier.group: gethomepage
      traefik.enable: "true"
      traefik.docker.network: traefik
...

acouvreur commented 3 months ago

I've done the following experiment:

Create a container to be scaled:

docker run -d -l sablier.enable=true --name whoami containous/whoami

Run Sablier

docker run -v '/var/run/docker.sock:/var/run/docker.sock' --cpu-shares=1024 -m 256m --name sablier -p 10000:10000 acouvreur/sablier:1.7.0 start --logging.level=trace

Observe stats
```
docker stats
```

Generate load

for i in {1..500}; do 
curl "http://localhost:10000/api/strategies/dynamic?names=whoami&session_duration=30s" &
curl "http://localhost:10000/api/strategies/blocking?names=whoami&session_duration=30s" &
curl "http://localhost:10000/api/strategies/blocking?group=default&session_duration=30s" &
curl "http://localhost:10000/api/strategies/dynamic?group=default&session_duration=30s" &
done

This will do 2k requests with names and group matching with both strategies.

https://github.com/acouvreur/sablier/assets/22034450/815f1198-ad31-4be6-ba92-85f1d00c3372

Still I cannot reproduce the CPU usage.

Note that I've set the same resource restrictions that you have.

luked34 commented 3 months ago

I've done some testing on my setup, and I can only reproduce it when I use a docker socket proxy. Looks like there's another issue open in the same area: https://github.com/acouvreur/sablier/issues/287

cyrinux commented 3 months ago

I've done some testing on my setup, and I can only reproduce it when I use a docker socket proxy. Looks like there's another issue open in the same area: #287

Ok, I confess I cleanup my config a little for pasting it in the previous post, and yes, I use a cetusguard proxy too ! Good catch !

cyrinux commented 3 months ago

I've done the following experiment:

1. Create a container to be scaled:

docker run -d -l sablier.enable=true --name whoami containous/whoami

2. Run Sablier

docker run -v '/var/run/docker.sock:/var/run/docker.sock' --cpu-shares=1024 -m 256m --name sablier -p 10000:10000 acouvreur/sablier:1.7.0 start --logging.level=trace

3. Observe stats

docker stats

4. Generate load

for i in {1..500}; do 
    curl "http://localhost:10000/api/strategies/dynamic?names=whoami&session_duration=30s" &
    curl "http://localhost:10000/api/strategies/blocking?names=whoami&session_duration=30s" &
    curl "http://localhost:10000/api/strategies/blocking?group=default&session_duration=30s" &
    curl "http://localhost:10000/api/strategies/dynamic?group=default&session_duration=30s" &
done

This will do 2k requests with names and group matching with both strategies. Recording.2024-06-27.145655.1.mp4

Still I cannot reproduce the CPU usage.

Note that I've set the same resource restrictions that you have.

Without doubt it works well in your lab, so I understand i'm facing docker socket proxy issue too.

acouvreur commented 3 months ago

I can try with the docker socket proxy. It might be related to Sablier listening for incoming events. Can you share your setup with a docker socket proxy ?

cyrinux commented 3 months ago

  cetusguard-sablier:
    container_name: cetusguard-sablier
    image: docker.io/hectorm/cetusguard
    read_only: true
    security_opt:
    - no-new-privileges:true
    volumes:
    - "/var/run/docker.sock:/var/run/docker.sock:ro"
    labels:
      com.centurylinklabs.watchtower.enable: "true"
      deunhealth.restart.on.unhealthy: "true"
    networks:
    - cetusguard-sablier
    environment:
      CETUSGUARD_BACKEND_ADDR: "unix:///var/run/docker.sock"
      CETUSGUARD_FRONTEND_ADDR: "tcp://:2375"
      CETUSGUARD_RULES: |
        ! Monitor events
        GET %API_PREFIX_EVENTS%
        ! Get system information
        GET %API_PREFIX_INFO%
        ! Get data usage information
        GET %API_PREFIX_SYSTEM%
        ! List containers
        GET %API_PREFIX_CONTAINERS%/json
        ! Inspect a container
        GET %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/json
        ! Start stop container
        POST %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/start
        POST %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/stop
        GET %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/stats
      CETUSGUARD_LOG_LEVEL: "2"
    restart: always

  sablier:
    image: acouvreur/sablier:1.7.0
    container_name: sablier
    mem_limit: 256m
    cpu_shares: 1024
    networks:
    - traefik
    - cetusguard-sablier
    environment:
    - PROVIDER_NAME=docker
    - DOCKER_HOST=tcp://cetusguard-sablier:2375
    labels:
      com.centurylinklabs.watchtower.enable: "true"
      deunhealth.restart.on.unhealthy: "true"
      traefik.enable: "true"
      # Group Middleware aur
      traefik.http.middlewares.sablier-aur.plugin.sablier.group: aur
      traefik.http.middlewares.sablier-aur.plugin.sablier.sablierUrl: http://sablier:10000
      traefik.http.middlewares.sablier-aur.plugin.sablier.sessionDuration: 720m
      traefik.http.middlewares.sablier-aur.plugin.sablier.blocking.timeout: 30s
      # Group Middleware gethomepage
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.group: gethomepage
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.sablierUrl: http://sablier:10000
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.sessionDuration: 120m
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.dynamic.theme: ghost
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.dynamic.showDetails: "false"

acouvreur commented 3 months ago

So, I've tried with the docker socket proxy with the following setup:

version: "3.7"

services:
  cetusguard-sablier:
    container_name: cetusguard-sablier
    image: docker.io/hectorm/cetusguard
    read_only: true
    security_opt:
    - no-new-privileges:true
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
    environment:
      CETUSGUARD_BACKEND_ADDR: "unix:///var/run/docker.sock"
      CETUSGUARD_FRONTEND_ADDR: "tcp://:2375"
      CETUSGUARD_RULES: |
        ! Monitor events
        GET %API_PREFIX_EVENTS%
        ! Get system information
        GET %API_PREFIX_INFO%
        ! Get data usage information
        GET %API_PREFIX_SYSTEM%
        ! List containers
        GET %API_PREFIX_CONTAINERS%/json
        ! Inspect a container
        GET %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/json
        ! Start stop container
        POST %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/start
        POST %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/stop
        GET %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/stats
      CETUSGUARD_LOG_LEVEL: "2"

  sablier:
    image: acouvreur/sablier:1.7.0
    ports:
      - 10000:10000
    environment:
      - PROVIDER_NAME=docker
      - DOCKER_HOST=tcp://cetusguard-sablier:2375

  whoami:
    image: containous/whoami:v1.5.0
    labels:
      - sablier.enable=true

And still after a lot of http calls:

d64a6122dad7   cetusguard-sablier   0.00%     31.59MiB / 7.744GiB   0.40%     789kB / 2.1MB   0B / 0B         9
9bd01d6419c6   sablier-whoami-1     0.00%     2.629MiB / 7.744GiB   0.03%     1.31kB / 0B     0B / 0B         4
eb3ee597b947   sablier-sablier-1    0.00%     50.02MiB / 7.744GiB   0.63%     4.49MB / 12MB   455kB / 4.1kB   12

luked34 commented 3 months ago

Docker sock config and proxy config are both commented - uncomment one at a time. Also docker.sock.raw is for Docker Desktop, use docker.sock otherwise

name: sablier2
services:
  proxy2:
    restart: always
    image: caddy:2.8.4-with-sablier # built as per sablier docs
    networks:
      - network2
    ports:
      - "8000:80" # whoami
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro

  whoami2:
    restart: none
    image: traefik/whoami
    networks:
      - network2
    labels:
      - sablier.enable=true
      - sablier.group=demo

  sablier2:
    restart: always
    image: acouvreur/sablier
    networks:
      - network2
      - sablier-socket-network2
#    volumes:
#      - '/var/run/docker.sock.raw:/var/run/docker.sock'
#    environment:
#      - DOCKER_HOST=tcp://socket-proxy-sablier2:2375

  socket-proxy-sablier2:
    image: lscr.io/linuxserver/socket-proxy:latest
    environment:
      - ALLOW_START=1 
      - ALLOW_STOP=1 
      - ALLOW_RESTARTS=1 
      - CONTAINERS=1 
      - EVENTS=1 
      - INFO=1 
      - NETWORKS=1 
      - PING=1 
      - POST=1 
      - SERVICES=1 
      - VERSION=1 
    volumes:
      - /var/run/docker.sock.raw:/var/run/docker.sock:ro
    restart: always
    read_only: true
    tmpfs:
      - /run
    networks:
      - sablier-socket-network2

networks:
  network2:
    name: sablier-network2
  sablier-socket-network2:
    name: sablier-socket-network2

acouvreur commented 3 months ago

I can create a discord server for the project so we can actually debug this together, what do you think ?

I need something with voice channels, do you have any suggestions ?

acouvreur commented 3 months ago

Docker sock config and proxy config are both commented - uncomment one at a time. Also docker.sock.raw is for Docker Desktop, use docker.sock otherwise

name: sablier2
services:
  proxy2:
    restart: always
    image: caddy:2.8.4-with-sablier # built as per sablier docs
    networks:
      - network2
    ports:
      - "8000:80" # whoami
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro

  whoami2:
    restart: none
    image: traefik/whoami
    networks:
      - network2
    labels:
      - sablier.enable=true
      - sablier.group=demo

  sablier2:
    restart: always
    image: acouvreur/sablier
    networks:
      - network2
      - sablier-socket-network2
#    volumes:
#      - '/var/run/docker.sock.raw:/var/run/docker.sock'
#    environment:
#      - DOCKER_HOST=tcp://socket-proxy-sablier2:2375

  socket-proxy-sablier2:
    image: lscr.io/linuxserver/socket-proxy:latest
    environment:
      - ALLOW_START=1 
      - ALLOW_STOP=1 
      - ALLOW_RESTARTS=1 
      - CONTAINERS=1 
      - EVENTS=1 
      - INFO=1 
      - NETWORKS=1 
      - PING=1 
      - POST=1 
      - SERVICES=1 
      - VERSION=1 
    volumes:
      - /var/run/docker.sock.raw:/var/run/docker.sock:ro
    restart: always
    read_only: true
    tmpfs:
      - /run
    networks:
      - sablier-socket-network2

networks:
  network2:
    name: sablier-network2
  sablier-socket-network2:
    name: sablier-socket-network2

Ok, actually, thanks @luked34 I can actually reproduce the issue now!

89509d81c1ec   sablier-sablier-1                100.17%   11.55MiB / 7.744GiB   0.15%     26.7kB / 18.6kB   889kB / 4.1kB   8

luked34 commented 3 months ago

I can create a discord server for the project so we can actually debug this together, what do you think ?

I need something with voice channels, do you have any suggestions ?

I would have suggested discord, might be worth setting one up even if we don't use it for this issue

acouvreur commented 3 months ago

I think I found the issue, but the issue seems to happen when the event stream returns a nil channel.

In such case the event stream cannot be properly initialized and the select statement read nil values indefinitely.

I can make a fix to properly break out of the faulty event stream, however, this means that there is something wrong with your setup as we can't properly listen to the event stream.

I'll publish a fix in the beta branch, please test the fix when it's released and look for errors in the logs for details in your specific cases.

acouvreur commented 3 months ago

So basically what happens is the client is registered without making sure the docker connection is established. It tries to listen for events, but it returns closed channels and spin up CPU high like in a for loop.

But it works afterwards because, by the time being, the connection was established.

I've changed this behavior so sablier will actually exit if the connection cannot be established.

You've all probably had some kind of race in which the connection was unavailable to the docker host at the time of registering the event stream.

This ended up burning the CPU, for which I've also added a fail-sage.

acouvreur commented 3 months ago

The fix is available in v1.7.1-beta.1, let me know if still happens.

kraoc commented 3 months ago

Alpine Linux
3.20
sablier:latest (:p)
Docker (not desktop)
sablier.yml
```
provider:
name: docker
```

server: port: 10000 base-path: /

sessions: default-duration: 5m expiration-interval: 30s

logging: level: warn

strategy: dynamic: show-details-by-default: false default-theme: zogg default-refresh-frequency: 5s blocking: default-timeout: 5m


* endpoint ? none... simple Proxmox/Dozzle monitoring...
* Usage: sablier container + sablier Traefik plugin

Maybe important points... 
* Seems to consume cpu only after vm/docker/container start (eg. in case of a vm restart)
* usage of wollomatic/socket-proxy:1

So, maybe in a vm restart condition Sablier (which depend on socket proxy) is started before the socket proxy.
This may explain when I restart sablier manually it works fine as proxy is already started...

_Sorry for late reply :p_

luked34 commented 3 months ago

Thanks for the quick fix, it seems to be working.

The log now shows two extra messages:

"provider event stream errorunexpected EOF"
"provider event stream is closed\<nil>"

I've tried adding a depends_on to the compose file, restarting Sablier while leaving the socket proxy running, and changing the socket proxy permissions to allow everything, but the messages still appear.

CPU usage is consistently below 1% now, so I'm happy to ignore the messages :)

kraoc commented 3 months ago

Maybe using willfarrell/autoheal with setting to unhealthy Sablier container when no socket proxy available should maid the trick ? Relying on autoheal ability to restart containers ? Just a suggestion... :p

cyrinux commented 3 months ago

The fix is available in v1.7.1-beta.1, let me know if still happens.

I will check again later but at first glance it's far away better! Thank you!

acouvreur commented 3 months ago

Maybe using willfarrell/autoheal with setting to unhealthy Sablier container when no socket proxy available should maid the trick ? Relying on autoheal ability to restart containers ? Just a suggestion... :p

When Sablier cannot connect to the docker host, it should fail and exit.

If you simply add the option restart: always you will have a default kind of autoheal.

kraoc commented 3 months ago

Already has always :p But all my compose use inheritance (multi-level extends) so my setup is somewhat unusual I think ^^

acouvreur / sablier

high cpu usage with only two services managed #272