coollabsio / coolify

An open-source & self-hostable Heroku / Netlify / Vercel alternative.
https://coolify.io
Apache License 2.0
30.38k stars 1.54k forks source link

[Bug]: High cpu usage /app/sentinel #2771

Open mutonby opened 1 month ago

mutonby commented 1 month ago

Description

Spikes of 99% cpu in all cores (actually 8 cores) with the process /app/sentinel

Minimal Reproduction (if possible, example repository)

We have 4 statics fronts, 1 mongodb and 1 backend docker compose

Exception or Error

error logs seeing in sentinel:

[GIN] 2024/07/08 - 11:34:00 | 200 | 59.17µs | 127.0.0.1 | GET "/api/health" Failed to get usage for partition /boot: no such file or directory Error getting container metrics: json: unsupported value: NaN Error getting container metrics: json: unsupported value: NaN [GIN] 2024/07/08 - 11:34:10 | 200 | 46.316µs | 127.0.0.1 | GET "/api/health" Error getting container metrics: json: unsupported value: NaN Error getting container metrics: json: unsupported value: NaN

Version

v4.0.0-beta.306

mutonby commented 1 month ago

Any idea what it might be or if I can do something to reduce the spikes?

mutonby commented 1 month ago

image

victorcavero14 commented 1 month ago

Same bug here

ksaitor commented 1 month ago

Same here! This is really bad. CPU went to max. Instance became unresponsive.

Screenshot 2024-07-09 at 10 16 10
ksaitor commented 1 month ago

@andrasbacsai just suggest people to use Netdata for their server monitoring and get them to sponsor Coolify. Imho adding these additional dependencies sabotages stability of Coolify

mejiasd3v commented 1 month ago

+1

after selecting 1 week in the period dropdown, server became unresponsive

Rhiz3K commented 1 month ago

Same(also with RAM spike) happened today after update to 307: image

Discord helped me to disable metrics(Sentinel restart didn't helped) and it went to normal: image

PovarovDenis commented 1 month ago
image

same here, it's hetzner ARM server

nickneustroev commented 1 month ago

Same. Had to disable it. image

maietta commented 1 month ago

I found my servers unresponsive this morning.

One of them was caused by Sentinel doing something strange with disk io and ramping up on CPU.

The others were because of a crippled network caused by a hurricane.

ViX3L commented 1 month ago

Both, ram and cpu spikes were intolerable. Had to restart the sentinel container, and operation were smooth again.

devdjdjdj commented 1 month ago

Use netdata meanwhile

version: '3'
services:
  netdata:
    image: netdata/netdata
    container_name: netdata
    pid: host
    network_mode: host
    restart: unless-stopped
    cap_add:
      - SYS_PTRACE
      - SYS_ADMIN
    security_opt:
      - 'apparmor:unconfined'
    volumes:
      - 'netdataconfig:/etc/netdata'
      - 'netdatalib:/var/lib/netdata'
      - 'netdatacache:/var/cache/netdata'
      - '/:/host/root:ro,rslave'
      - '/etc/passwd:/host/etc/passwd:ro'
      - '/etc/group:/host/etc/group:ro'
      - '/etc/localtime:/etc/localtime:ro'
      - '/proc:/host/proc:ro'
      - '/sys:/host/sys:ro'
      - '/etc/os-release:/host/etc/os-release:ro'
      - '/var/log:/host/var/log:ro'
      - '/var/run/docker.sock:/var/run/docker.sock:ro'
    labels: 'traefik.http.middlewares.http-0-o0w4g8k-netdata.basicauth.users=test:$2y$12$ci.4U63YX83CwkyUrjqxAucnmi2xXOIlEF6T/KdP9824f1Rf1iyNG'
volumes:
  netdataconfig: null
  netdatalib: null
  netdatacache: null

Refer to this for basic auth - https://coolify.io/docs/knowledge-base/traefik/basic-auth