coollabsio / coolify

An open-source & self-hostable Heroku / Netlify / Vercel alternative.
https://coolify.io
Apache License 2.0
31.94k stars 1.65k forks source link

[Bug]: Unexplained High CPU Usage Spike in Coolify Following Deployment Attempt #2110

Open galacoder opened 4 months ago

galacoder commented 4 months ago

Description

I have been running Coolify for 8 days with various services, encountering no prior issues. However, on the night of April 30th EST, I experienced a significant CPU usage spike starting around 11 PM, shortly after an unsuccessful attempt to deploy a React application. It is unclear whether this issue was directly related to the deployment failure, a potential attack, or another problem.

Expected Behavior

CPU usage should remain stable, without significant spikes, particularly when no active deployments or heavy tasks are underway.

Actual Behavior

CPU usage unexpectedly spiked to over 300% and remained high throughout the night, which was unusual and concerning, given the context.

Environment

Additional Context

The sudden surge in CPU usage occurred post the deployment failure, but it is uncertain if the spike was a direct result of this event, a security issue, or another underlying problem. This incident warrants further investigation to prevent future occurrences.

I already tried to restart my VPS 2 times, but the problem still insisted.

I would appreciate any insights or troubleshooting steps you could recommend to help identify and resolve the root cause of this spike. Thank you for your assistance.

Minimal Reproduction (if possible, example repository)

Steps to Reproduce

  1. Set up and run Coolify with multiple services for over a week.
  2. Attempt to deploy a React app, which fails.
  3. Monitor CPU usage and observe an unexpected spike starting around 11 PM EST, continuing without additional user actions.

Exception or Error

Screenshots

Screenshot 2024-05-01 at 23 11 05 Screenshot 2024-05-01 at 23 11 33

Version

4.0.0-beta.271

marwie commented 4 months ago

Hello, I don't have many details to share right now we installed coolify on some small hetzner servers yesterday - it was working like a breeze yesterday but today every deployment takes over all resources for several minutes with CPU usage just above 200% and very high IO ops on the harddrive.

I'm currently trying to figure out what might be causing this - if a change in my docker container is responsible - but currently waiting for the server to be available again.

image Last 24 hours

image Last hour

The hetzner console tells me it's out of memory - the server is one of the smallest available with just 4 GB - but the same process worked fine yesterday

image

UPDATE: restarting the hetzner server fixed the issue for me (for now) - i hope it doesnt happen again

AspireOne commented 3 months ago

Second this. As of today, I have the same issue. Fresh new Coolify installation, fresh new Contabo server (4vCPU, 6GB RAM). Takes up 100% of CPU. A small nodejs backend, which takes 5s to build on my machine, has been building for 10 minutes and counting.

v4.0.0-beta.294

root@vmi1916516:~# mpstat -u | awk '/all/ {printf "%.2f%%\n", 100-$12}'
100.00%

Don't know if it's normal, but php 8.2 and php-fpm often take up to 70% of my CPU when navigating Coolify. And not just for a split second, but steadily.

image


The usage jumps from one thing to the other, all while building a nodejs server with 200 lines of code...

image

image


I was planning to build my backend on Coolify. Do I just ditch it now or what?

marke-dev commented 3 months ago

similar issue coolify taking up large amounts of cpu and fluctuates on command /init

Screenshot 2024-06-05 at 9 06 06 PM Screenshot 2024-06-05 at 10 12 29 PM
swissbyte commented 3 months ago

same here.

BTW: would be awesome if we could get an stat overview about the running containers directly on coolify to see theirs CPU and Memory usage.

jamesryancooper commented 3 months ago

Seeing a similar issue on v4.0.0-beta.297.

mpanibrat commented 3 months ago

Same issue. Worked fine for a few days and now started to hang unresponsive on NextJS deployment.

Coolify: v4.0.0-beta.297 hetzner: CPX11 | x86 | 40 GB | us-west

image
swissbyte commented 3 months ago

Sometimes it helps to restart coolify or the host server. Then its ok again for about 1-2h

Nedi11 commented 3 months ago

Check if you have enough ram. In my case swapd (which is responsible for using swap memory) would take all the CPU and adding more ram fixed it

swissbyte commented 3 months ago

I do have enough free ram. Its also not the swapd process that eats up all the cpu. Thanks for the proposal

ck-euan commented 3 months ago

I am having the same issue. I have three servers running, a Coolify server, a build server and a server just to run the containers (2 nextjs apps). It's always the server running the containers that goes down.

image

This is the chart from the most recent crash, it's particularly weird because no deployments were happening at the time and I can't see any traffic spikes either, just seemed to be random.

swissbyte commented 3 months ago

@andrasbacsai is it safe to rollback to from 297 to 4.0.0 296 for example? Cause the high CPU is making my prod environment nearly un usable... Yes, i know, i learned my lesson the hard way. never enable auto updates on prod systems...

atilladeniz commented 3 months ago

same issue!!

CleanShot-Server-Nutzung  Hostinger-Google Chrome-2024-06-21 at 00 50 18@2x
swissbyte commented 3 months ago

My coolify docker container shows also as „unhealthy“

johnpccd commented 3 months ago

Same here v4.0.0-beta.297 image

image

last night i had a failed nextjs deployment but the high cpu only started like 10 hours later

Edit: Tried to bash into the coolify container: i can bash but once i'm in, any command hangs forever. even pwd. same happens with the coolify-db container

I restarted the coolify container.. let's see if the problem appears again.

swissbyte commented 3 months ago

Deployment of a service or also redeployment takes around 10-15 minutes. The same service was redeployed within 15-50 seconds before…

atilladeniz commented 3 months ago

I reinstall it with Ubuntu 20.04 now it works fine.. Ubuntu 22.04 and 24.04 not working for me! Another Server I use Coolify with Debian 11 and it's better!

swissbyte commented 3 months ago

Interesting. Have you tried 22.04 and saw high CPU and then 24.04 as well?

or in other words… is it reproducable?

atilladeniz commented 3 months ago

yes i try both versions 22.04 and 24.04 both the same 100% CPU High Usage issue! only on Ubuntu 20.04 and Debian 11 is good!

marke-dev commented 3 months ago

I have Ubuntu 20.04 and still have this issue. Every week coolify will fail and I have to restart the server. It's just coolify.

swissbyte commented 3 months ago

What happens if we limit the cpu usage of the coolify container?

swissbyte commented 3 months ago

Hey Guys.... v. 298 is out now :) https://github.com/coollabsio/coolify/releases/tag/v4.0.0-beta.298 At least on my side, it seems to not really change the CPU behaviour dramatically... How about you?

atilladeniz commented 3 months ago

For all of you with 100% CPU Issue is anybody use supabase? because I install everything again only not supabase and have no issues! when i install supabase it happens again with the high 100% Usage..

marke-dev commented 3 months ago

I have had nothing installed or running at one point, other than coolify and still had spikes

atilladeniz commented 3 months ago

whats vps provider you use?

johnpccd commented 3 months ago

For all of you with 100% CPU Issue is anybody use supabase? because I install everything again only not supabase and have no issues! when i install supabase it happens again with the high 100% Usage..

i don't have supabase, and i saw it once

marwie commented 3 months ago

Not using supabase hosting on hetzner @atilladeniz

atilladeniz commented 3 months ago

very strange! it's spooky this problem! can not sleep a few days well beause of heart attack every second the server can goes to 100% and slows my connections and latency on the vps

cpulimit not works for me ! it goes up always !

atilladeniz commented 3 months ago

@marke-dev can you send the logs of the docker container coolify i want look through it.. maybe we find the bug fix together!

johnpccd commented 3 months ago

@atilladeniz can you check if the cpu spike coincided with a database backup?

marke-dev commented 3 months ago

whats vps provider you use?

I'm not on a VPS but a dedicated server with IONOS

marke-dev commented 3 months ago

@marke-dev can you send the logs of the docker container coolify i want look through it.. maybe we find the bug fix together!

Yes definitely, I'll send that later today

madebylydia commented 3 months ago

Hello, I am experiencing a similar issue on my side too, using v4.0.0-beta.306.

System spec. This is running off Oracle Always Free Tier's ARM machine. ![System specification](https://github.com/coollabsio/coolify/assets/61093863/be25abaa-0d1d-42ee-b798-35572974dcfb) ![lscpu report](https://github.com/coollabsio/coolify/assets/61093863/ced4be0b-7917-48d0-9a05-3c9592972b59)
btop analysis ![image](https://github.com/coollabsio/coolify/assets/61093863/bedda9de-fd6a-4776-b135-06c3447216d4)

Doesn't really seem to be related to any failed deployment from my side, however... I would love to help hand out any information about this issue, feel free to request anything from me.

swissbyte commented 3 months ago

I have deployed beta.260 on the same server configuration but a new vps.

260 works perfectly fine. I think i will stay with 260 for the moment with disabled autoupdate.

Whats your plan?

rohit-32 commented 3 months ago

Same issue. Worked fine for a few days and now started to hang unresponsive on NextJS deployment.

Coolify: v4.0.0-beta.297 hetzner: CPX11 | x86 | 40 GB | us-west

image

This is happening because next build uses all cpus, am trying to figure out how to restrict the same

Nedi11 commented 3 months ago

Same issue. Worked fine for a few days and now started to hang unresponsive on NextJS deployment. Coolify: v4.0.0-beta.297 hetzner: CPX11 | x86 | 40 GB | us-west

image

This is happening because next build uses all cpus, am trying to figure out how to restrict the same

You can set resource limits on an app here: image

Also try this to limit the CPUs of just next build: https://github.com/vercel/next.js/discussions/65983

ck-euan commented 2 months ago

Just had another outage roughly a week after switching from EC2 to Digital Ocean, once again it was the server running the apps that crashed and not the Coolify or build servers. No build was running either it was just spontaneous. I did have to reboot both the Coolify server and the Apps server though, rebooting just apps didn't work.

image
swissbyte commented 2 months ago

One question... why do we all run production or at least "should be always up" stuff on a server with automatic updates on? 😅

johnpccd commented 2 months ago

One question... why do we all run production or at least "should be always up" stuff on a server with automatic updates on? 😅

I don't have automatic updates enabled. The issue happened on a version that was working fine for some time.

swissbyte commented 2 months ago

One question... why do we all run production or at least "should be always up" stuff on a server with automatic updates on? 😅

I don't have automatic updates enabled. The issue happened on a version that was working fine for some time.

Ok i see. I started with coolify 230 or so and after it updated to 297 the problems started i guess... this is why i implied that we all were updating

ghsteff commented 2 months ago

Same issue. Worked fine for a few days and now started to hang unresponsive on NextJS deployment. Coolify: v4.0.0-beta.297 hetzner: CPX11 | x86 | 40 GB | us-west

image

This is happening because next build uses all cpus, am trying to figure out how to restrict the same

You can set resource limits on an app here: image

Also try this to limit the CPUs of just next build: vercel/next.js#65983

Adding swap space to my little 2gb hetzner server fixed this for me

From this thread https://github.com/coollabsio/coolify/issues/2088#issuecomment-2082327051

rohit-32 commented 2 months ago

2gb hetzner s

@ghsteff let me give it a try

tiotdev commented 2 months ago

Had the same problem on my 8 GB RAM Hetzner server. Adding 8 GB swap space solved it for me.

Edit: Didn't solve it, the spikes are just less common

tiotdev commented 2 months ago

The CPU spikes kept taking my app down for several minutes every few days. Today, there was a downtime of several hours. This time I got this log alert: PullTemplatesAndVersions failed with: cURL error 28: Operation timed out after 30001 milliseconds with 0 out of 372726 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for https://raw.githubusercontent.com/coollabsio/coolify/main/templates/service-templates.json. I don't think the issue lies with Hetzner since restarting the server fixed it. Hope this will be fixed soon. Coolify is amazing to use, but uptime is critical for me. I went back to my previous, manual docker setup that has proved to be reliable. I will follow the progress of Coolify and give it another try once it's out of beta.

Aft1n commented 2 months ago

For a week my app was working fine, and last couple of days i felt degradation in responsiveness. I checked htop and docker stats, and saw tremendous spikes in coolify docker CPU usage, though i don't have many apps: one bun app with pm2, umami stats and uptime-kuma.

Also have a subserver controlled by coolify where it deploys couple of dockerized apps. Nothing overwhelming from my perspective.

Знімок екрана 2024-07-22 о 12 12 23
swissbyte commented 2 months ago

Hi @Aft1n yes, i experienced the same... i re installed/downgraded to coolify 260 and now its working perfectly fine again

Aft1n commented 2 months ago

I have shot down the server, and turned it back on after a minute, fixed it for now. But the question is for how long this effect will last

marke-dev commented 2 months ago

I have revisited this:

Taking a deeper look into the coolify container. I noticed possible php-fpm seems to be cause this issue or at least part of the issue all my containers are stopped through coolify, literally only coolify is running

hh:mm:ss 00:35:03 php-fpm: pool www 00:09:37 php /var/www/html/artisan start:horizon 00:11:29 php /var/www/html/artisan start:scheduler 00:35:08 php-fpm: pool www 00:12:19 /usr/bin/php8.2 artisan horizon:supervisor ff5b1471ba88-xDox:s6 redis

P.S. yes that is 35 mins


Mon 22 Jul 2024 11:24:02 PM EDT: Container coolify is using 107% CPU Mon 22 Jul 2024 11:25:03 PM EDT: Container coolify is using 146% CPU Mon 22 Jul 2024 11:25:03 PM EDT: Container coolify is using 146% CPU Mon 22 Jul 2024 11:26:04 PM EDT: Container coolify is using 109% CPU Mon 22 Jul 2024 11:26:04 PM EDT: Container coolify is using 138% CPU Mon 22 Jul 2024 11:32:03 PM EDT: Container coolify is using 147% CPU Mon 22 Jul 2024 11:32:03 PM EDT: Container coolify is using 147% CPU Mon 22 Jul 2024 11:33:04 PM EDT: Container coolify is using 154% CPU Mon 22 Jul 2024 11:33:04 PM EDT: Container coolify is using 154% CPU

andrasbacsai commented 2 months ago

sorry for the super late answer.

The ssh connections (especially the background jobs) causing the high cpu usage. ssh (even with mux) needs tons of cpu.

I started to optimize the jobs to get all data in one ssh connection.

BThero commented 1 month ago

Experiencing the same issue with beta.319 on a DigitalOcean VPS hosting. Do I understand it right that the solution for now is to rollback to beta.260 and wait for an official resolution?

madebylydia commented 1 month ago

Considering @andrasbacsai started working on it, I'd just wait patiently until the problem is resolved, to not lose all new features that was introduced (there's a few). It all depends to your need, ultimately, you're the one in charge for your server. Your call.