coollabsio / coolify

An open-source & self-hostable Heroku / Netlify / Vercel alternative.
https://coolify.io
Apache License 2.0
34.3k stars 1.86k forks source link

[Bug]: Unexplained High CPU Usage Spike in Coolify Following Deployment Attempt #2110

Open galacoder opened 6 months ago

galacoder commented 6 months ago

Description

I have been running Coolify for 8 days with various services, encountering no prior issues. However, on the night of April 30th EST, I experienced a significant CPU usage spike starting around 11 PM, shortly after an unsuccessful attempt to deploy a React application. It is unclear whether this issue was directly related to the deployment failure, a potential attack, or another problem.

Expected Behavior

CPU usage should remain stable, without significant spikes, particularly when no active deployments or heavy tasks are underway.

Actual Behavior

CPU usage unexpectedly spiked to over 300% and remained high throughout the night, which was unusual and concerning, given the context.

Environment

Additional Context

The sudden surge in CPU usage occurred post the deployment failure, but it is uncertain if the spike was a direct result of this event, a security issue, or another underlying problem. This incident warrants further investigation to prevent future occurrences.

I already tried to restart my VPS 2 times, but the problem still insisted.

I would appreciate any insights or troubleshooting steps you could recommend to help identify and resolve the root cause of this spike. Thank you for your assistance.

Minimal Reproduction (if possible, example repository)

Steps to Reproduce

  1. Set up and run Coolify with multiple services for over a week.
  2. Attempt to deploy a React app, which fails.
  3. Monitor CPU usage and observe an unexpected spike starting around 11 PM EST, continuing without additional user actions.

Exception or Error

Screenshots

Screenshot 2024-05-01 at 23 11 05 Screenshot 2024-05-01 at 23 11 33

Version

4.0.0-beta.271

swissbyte commented 3 months ago

@BThero i stay with 260 for now

Nedi11 commented 3 months ago

still experiencing this issue just out of the blue high cpu usage and disk throughput coolify and everything hosted on it crashes

Screenshot ![image](https://uploads.linear.app/0fa98b19-e5f3-418f-abaf-f6d47b6f525f/ae98218e-10ce-4c79-87a3-ab517a1abb12/76b5fb82-52c9-4671-b2de-60c3abff1e4e?signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJwYXRoIjoiLzBmYTk4YjE5LWU1ZjMtNDE4Zi1hYmFmLWY2ZDQ3YjZmNTI1Zi9hZTk4MjE4ZS0xMGNlLTRjNzktODdhMy1hYjUxN2ExYWJiMTIvNzZiNWZiODItNTJjOS00NjcxLWIyZGUtNjBjM2FiZmYxZTRlIiwiaWF0IjoxNzIyNzEwOTM5LCJleHAiOjMzMjkzMjcwOTM5fQ.yz87rq6TESnmVvfKxg1tzkuyRmnafu-I-xdjrBp6FVA)
andrasbacsai commented 3 months ago

still experiencing this issue just out of the blue high cpu usage and disk throughput coolify and everything hosted on it crashes

image

Can you check if your disk is not full?

Nedi11 commented 3 months ago

Can you check if your disk is not full?

It is 89% full

tschuehly commented 3 months ago

I'm also experiencing 100% cpu usage on a new install on a 4cpu contabo server. Seems like artisan ist taking up all the CPU.

Screenshot 2024-08-08 at 18 49 14
Nedi11 commented 3 months ago

In my case it could be a storage issue, now its 98% used and crashed

marke-dev commented 3 months ago

Definitely not a storage issue. I can have no projects running and coolify will no longer be accessible to the dashboard. It takes about 4-7 days then I have to either restart my server or spin down coolify and try again.

cblberlin commented 2 months ago

same here, i got an 4g 4core vps on OVH, and recently some deployments triggers the whole vps crashed i have to reboot the vps to get it worked

boyhax commented 2 months ago

please we need fix for this problem.it destroyed my projects on coolify if i started coolify container it overload cpu and crush the server.

ShowtimeProp commented 2 months ago

I do have enough free ram. Its also not the swapd process that eats up all the cpu. Thanks for the proposal

It does the same on mine server hetzner I'v 2vcpu 8gb Ram 80 gb disc I don't know what's going on with Coolify

marke-dev commented 2 months ago

There have been updates recently from a week ago and my coolify has been working great. I haven't reviewed recent updates but got to say improvements have been made my server is still up and running for about week

cblberlin commented 2 months ago

Quote reply

yes most of the deployments don't have problem, but i just don't know why some of it will crash the vps

ck-euan commented 2 months ago

I haven't had an outage now since June 27th, the latest versions of Coolify seem to have had a positive impact on the stability

cblberlin commented 2 months ago

I haven't had an outage now since June 27th, the latest versions of Coolify seem to have had a positive impact on the stability

can i know your vps configuration?

ck-euan commented 2 months ago

I haven't had an outage now since June 27th, the latest versions of Coolify seem to have had a positive impact on the stability

can i know your vps configuration?

This is a work setup so might not be suitable for many use cases, but I'm currently running 3 servers:

Build server: 2 vCPUs, 4 GB RAM, 80 GB Disk (Digital Ocean Premium Intel) Apps Server: 2 vCPUs, 4 GB RAM, 80 GB Disk(Digital Ocean Regular Intel) Coolify Server: 2 vCPUs, 2 GB, 25 GB Disk

I also allocated swap space to the servers, 2GB on the larger servers and 1GB on the Coolify one.

We used to have crashes weekly with this setup but haven't for around 2 months now even with pushing 15-20 deployments out a day

jurgen-siegel commented 2 months ago

my website is extremely slow. not sure if it has anything to do with that

swissbyte commented 2 months ago

It has! Myhad a response time of about 2-7 seconds per page load. After going back to version 260 everything was fine again

jurgen-siegel commented 2 months ago

i am on 323 i will try 260

cblberlin commented 2 months ago

It has! Myhad a response time of about 2-7 seconds per page load. After going back to version 260 everything was fine again

mine was pretty fast once the deployment is done, maybe because mine is a simple website? you can try to deploy on vercel to see if it's the coolify problem or maybe your backend was not that quick? or your bundle size is too large?

jurgen-siegel commented 2 months ago

which version of coolify are you guys using?

cblberlin commented 2 months ago

which version of coolify are you guys using?

i'm using the 323

jurgen-siegel commented 2 months ago

which version of coolify are you guys using?

i'm using the 323

is it slow for you? page load on my website is 2-7 seconds just like @swissbyte mentioned

cblberlin commented 2 months ago

which version of coolify are you guys using?

i'm using the 323

is it slow for you? page load on my website is 2-7 seconds just like @swissbyte mentioned

for me, i ran some Backend system (some dashboard with different use cases) website using next.js and react, it's all front-end backend separated, the backend i wrote it on FastAPI, so i guess it's not that heavy? and i put all the pagination and sort and filter in the backend so it won't throw all data in the same time, and my vps is 4g 4core 80g with OVH, so once it's deployed, it's pretty quick, i don't know what kinda website you deployed, but if you can provide the Network screen shot in browser, it would be better

madebylydia commented 2 months ago

Screenshot_20240828-170030.png

Thank you for absolutely flooding my email. I think now is a good time to remind everyone that some people are following this issue to know when it's fixed, and it's being worked on. See what @ andrasbacsai answered weeks ago.

Please guys, use the comment only for anything useful, if you're a specific case but not general discussion. This is just spam at this point.

andrasbacsai commented 2 months ago

I prioritize this issue.

andrasbacsai commented 2 months ago

I anyone still have a server where it occurs and would like to help me debugging, do not hesitate to write me on Discord, or https://coolify.io/docs/contact

Until then, I do my best to reproduce and found the bottleneck. I already fixed a few, but still unsure what is causing it exactly.

Fadil3 commented 2 months ago

Hello, @andrasbacsai,

I have the same problem. This is the screenshot from htop: image

I've tried to strace the PID and here are the results:

epoll_pwait(4, [], 1024, 289, NULL, 8) = 0

And:

epoll_pwait(4, [{events=EPOLLIN, data={u32=16, u64=16}}], 1024, 500, NULL, 8) = 1
read(16, "{\"jsonrpc\":\"2.0\",\"method\":\"job\","..., 65536) = 354
mmap(NULL, 69632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa3323ec000
munmap(0x7fa3323ec000, 69632)

I've tried to kill the process, but it keeps starting over and over again. This suggests that the process is likely being respawned or restarted automatically, perhaps as part of a system service or a background task.

Could you please investigate this issue and provide a fix or a workaround? The high CPU and RAM usage is causing significant performance problems on my system.

Please let me know if you need any additional information or if there's anything else I can do to assist with the investigation.

Thank you for your attention to this matter.

andrasbacsai commented 2 months ago

Hello, @andrasbacsai,

I have the same problem. This is the screenshot from htop:

image

I've tried to strace the PID and here are the results:


epoll_pwait(4, [], 1024, 289, NULL, 8) = 0

And:


epoll_pwait(4, [{events=EPOLLIN, data={u32=16, u64=16}}], 1024, 500, NULL, 8) = 1

read(16, "{\"jsonrpc\":\"2.0\",\"method\":\"job\","..., 65536) = 354

mmap(NULL, 69632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa3323ec000

munmap(0x7fa3323ec000, 69632)

I've tried to kill the process, but it keeps starting over and over again. This suggests that the process is likely being respawned or restarted automatically, perhaps as part of a system service or a background task.

Could you please investigate this issue and provide a fix or a workaround? The high CPU and RAM usage is causing significant performance problems on my system.

Please let me know if you need any additional information or if there's anything else I can do to assist with the investigation.

Thank you for your attention to this matter.

Thank you! It contains a few useful information!

Can you maybe grab what is exactly executed in the ssh process? I know it is one of the jobs, but I could not identify which one.

Are you on the latest version (325)?

Fadil3 commented 2 months ago

Hello, @andrasbacsai, I have the same problem. This is the screenshot from htop: image I've tried to strace the PID and here are the results:

epoll_pwait(4, [], 1024, 289, NULL, 8) = 0

And:

epoll_pwait(4, [{events=EPOLLIN, data={u32=16, u64=16}}], 1024, 500, NULL, 8) = 1

read(16, "{\"jsonrpc\":\"2.0\",\"method\":\"job\","..., 65536) = 354

mmap(NULL, 69632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa3323ec000

munmap(0x7fa3323ec000, 69632)

I've tried to kill the process, but it keeps starting over and over again. This suggests that the process is likely being respawned or restarted automatically, perhaps as part of a system service or a background task. Could you please investigate this issue and provide a fix or a workaround? The high CPU and RAM usage is causing significant performance problems on my system. Please let me know if you need any additional information or if there's anything else I can do to assist with the investigation. Thank you for your attention to this matter.

Thank you! It contains a few useful information!

Can you maybe grab what is exactly executed in the ssh process? I know it is one of the jobs, but I could not identify which one.

Are you on the latest version (325)?

i got this in ssh log

Sep 03 22:06:41 vmi1685278.contaboserver.net sshd[17039]: Failed password for root from 146.71.50.198 port 56530 ssh2
Sep 03 22:06:44 vmi1685278.contaboserver.net sshd[17039]: Connection closed by authenticating user root 146.71.50.198 port 56530 [preauth]
Sep 03 22:06:57 vmi1685278.contaboserver.net sshd[18474]: Connection from 61.177.172.181 port 61353 on [REDACTED] port 22 rdomain ""
Sep 03 22:07:06 vmi1685278.contaboserver.net sshd[19191]: Connection from 172.18.0.10 port 55642 on 172.17.0.1 port 22 rdomain ""
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Accepted key [REDACTED] found at /root/.ssh/authorized_keys:2
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Postponed publickey for root from 172.18.0.10 port 55642 ssh2 [preauth]
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Accepted key [REDACTED] found at /root/.ssh/authorized_keys:2
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Accepted publickey for root from 172.18.0.10 port 55642 ssh2: [REDACTED]
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[18474]: Received disconnect from 61.177.172.181 port 61353:11:  [preauth]
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[18474]: Disconnected from authenticating user root 61.177.172.181 port 61353 [preauth]
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Starting session: command for root from 172.18.0.10 port 55642 id 0
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Close session: user root from 172.18.0.10 port 55642 id 0
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Received disconnect from 172.18.0.10 port 55642:11: disconnected by user
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Disconnected from user root 172.18.0.10 port 55642
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: pam_unix(sshd:session): session closed for user root
Sep 03 22:07:08 vmi1685278.contaboserver.net sshd[19280]: Connection from 172.18.0.10 port 55648 on 172.17.0.1 port 22 rdomain ""
Sep 03 22:07:08 vmi1685278.contaboserver.net sshd[19280]: Accepted key [REDACTED] found at /root/.ssh/authorized_keys:2
Sep 03 22:07:09 vmi1685278.contaboserver.net sshd[19280]: Postponed publickey for root from 172.18.0.10 port 55648 ssh2 [preauth]
Sep 03 22:07:09 vmi1685278.contaboserver.net sshd[19280]: Accepted key [REDACTED] found at /root/.ssh/authorized_keys:2
Sep 03 22:07:09 vmi1685278.contaboserver.net sshd[19280]: Accepted publickey for root from 172.18.0.10 port 55648 ssh2: [REDACTED]
Sep 03 22:07:09 vmi1685278.contaboserver.net sshd[19280]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: Connection from 172.18.0.10 port 55664 on 172.17.0.1 port 22 rdomain ""
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: Accepted key [REDACTED] found at /root/.ssh/authorized_keys:2
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: Postponed publickey for root from 172.18.0.10 port 55664 ssh2 [preauth]
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: Accepted key [REDACTED] found at /root/.ssh/authorized_keys:2
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: Accepted publickey for root from 172.18.0.10 port 55664 ssh2: [REDACTED]
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: Starting session: command for root from 172.18.0.10 port 55664 id 0
Sep 03 22:07:23 vmi1685278.contaboserver.net sshd[20586]: Connection from 61.177.172.181 port 21578 on [REDACTED] port 22 rdomain ""

is this a bruteforce SSH attack?

Yes, i use latest version

peaklabs-dev commented 2 months ago

@Fadil3 No this is most like not a brute force attack, as there is only 1 failed login attempt. The connections via SSH are authenticated via public key and come from an internal IP 172.18.0.10 this is most likely a docker container. It is a bit to frequent for my taste though.

If you self host could you maybe check https://COLLIFY_IP:8000/horizon so we can see what jobs are run?

Dev note: @andrasbacsai we believe this is caused by a Job right? Opening so many SSH connections in so less seconds has to be a script or a something CI/CD related (automated). I will investigate more.

Fadil3 commented 2 months ago

https://COLLIFY_IP:8000/horizon

This is screenshot from the horizon @peaklabs-dev @andrasbacsai

image

image

peaklabs-dev commented 2 months ago

Edit: This was just me not thinking while tired, it is the normal compose healthcheck.

After some digging I found something strange. Not sure if this is related but when I deploy ClassicPress with MariaDB, and check the container logs of the ClassicPress container, there is a call from localhost every 2s:

This snippet is a small part, it goes on forever like this.

127.0.0.1 - - [04/Sep/2024:11:54:17 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:20 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:22 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:24 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:26 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:28 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:30 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:32 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:34 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:36 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:38 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:40 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:42 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:44 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:46 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:48 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:50 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:52 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:54 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:56 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:58 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:00 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:02 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:04 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:06 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:08 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:10 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:13 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:15 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:17 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:19 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:21 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:23 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:25 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:27 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:29 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:31 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:33 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:35 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:37 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:39 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:41 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:43 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:45 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:47 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:49 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:51 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:53 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:55 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"

-> This could make the site really slow and a health check should not be that frequent, right?

Vahor commented 2 months ago

@peaklabs-dev That's the default healthcheck for the container https://github.com/coollabsio/coolify/blob/3f34df251e01170e8716e8ee2eef3eaa432f5591/templates/compose/classicpress-with-mariadb.yaml#L19-L23

programad commented 2 months ago

Just had this bug. Hetzner graphs skyrocketed, up to 200% cpu. And top command shows information that maybe can help:

image

The high CPU on kswapd0 indicates that I am using too much swap and if you look close, there is a high memory usage on npm ci. I think this started when I commited to a github repo and it triggered a deploy on Coolify and for some reason, it is stuck on that command. Other possibility is Coolify Upgrade. Before the crash, Coolify was showing the Upgrade button (I have the auto update turned on). Maybe it tried to upgrade and conflicted with something, maybe my github deploy? Both at the same time? Don't know.

Tried freeing some swap and it didn't work. I'll try to kill the npm ci.

programad commented 2 months ago

I fixed my situation following this procedure suggested by an LLM:

The system has 0 MiB of swap available, which means swapping is disabled, yet kswapd0 is still active and consuming CPU. This behavior can occur due to misconfigured virtual memory settings or low available RAM.

Steps to Fix:

  1. Ensure Enough RAM: Your system has around 3.8 GB of RAM, and only 8.9 MB is available. It seems the system is under memory pressure, causing kswapd0 to trigger swapping behavior despite swap being unavailable.

    • You might need to upgrade the RAM or optimize memory-heavy applications.
    • Reduce the number of processes running, especially those with significant memory usage like npm ci, vector, docker, beam.smp, and node processes.
  2. Re-enable Swap: Since swap space is currently 0 MiB (disabled), re-enabling swap can help reduce the burden on the physical memory and lower CPU usage by kswapd0.

    • To check if a swap file exists:

      sudo swapon --show
    • If there’s no swap space, you can create and enable a swap file:

      sudo fallocate -l 4G /swapfile  # Adjust size as needed
      sudo chmod 600 /swapfile
      sudo mkswap /swapfile
      sudo swapon /swapfile
    • To make this change permanent, add the swap file to /etc/fstab:

      echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
  3. Adjust swappiness: Once swap is enabled, you can adjust the swappiness value to control how aggressively the kernel swaps out memory.

    Set swappiness to a lower value (e.g., 10) to reduce swap usage:

    sudo sysctl vm.swappiness=10

    Make it permanent by adding to /etc/sysctl.conf:

    echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
  4. Investigate Memory-Intensive Processes: From the output, npm ci, docker, beam.smp, and node processes are consuming a significant amount of resources. You might want to:

    • Restart or optimize these processes.
    • Kill non-essential processes to free up memory:

      sudo kill <PID>

      (Replace <PID> with the process ID of the resource-hungry process.)

By enabling swap and optimizing resource usage, you should see a significant reduction in CPU usage by kswapd0.

My graphs now are going back to normal:

image

Bonus: now Coolify is up again and I confirmed it was not the Coolify updating, it is still showing the Upgrade button.

Vahor commented 2 months ago

@Fadil3 can you check that in your bashrc there's something like

# If not running interactively, don't do anything
case $- in
    *i*) ;;
      *) return;;
esac

or a shorter version

# If not running interactively, don't do anything
[ -z "$PS1" ] && return

As coolify use a lot of ssh connections, if you have a bashrc that does something on a user login (even if we request no tty it's still used) this might have an impact. Depending on how coolify handle this (on large compose file / multiple deployment etc) this can be an issue.

Note: I did a test on one of my server. And every minute, coolify run 5 commands:

Done by doing a tail -f on /var/log/auth.log to see ssh connections/commands

an improvement might be to reuse a same ssh session. By looking at the code I can see that there's a mux system but not sure if that work, I added prints in the code and it does not seem to be enabled (maybe only in local haven't really checked)

reactivesilicon commented 2 months ago

any update on this ? I face the same issue using hetzner

Fadil3 commented 2 months ago

@Fadil3 can you check that in your bashrc there's something like

# If not running interactively, don't do anything
case $- in
    *i*) ;;
      *) return;;
esac

or a shorter version

# If not running interactively, don't do anything
[ -z "$PS1" ] && return

As coolify use a lot of ssh connections, if you have a bashrc that does something on a user login (even if we request no tty it's still used) this might have an impact. Depending on how coolify handle this (on large compose file / multiple deployment etc) this can be an issue.

Note: I did a test on one of my server. And every minute, coolify run 5 commands:

  • ls /
  • docker container ls -q
  • docker inspect [id1] [id2] .. --format '{{json .}}'
  • ls / (twice, not a typo)
  • docker inspect --format '{{json .}}' coolify-proxy (on that server the proxy is voluntarily stopped, so this shouldn't happen)

Done by doing a tail -f on /var/log/auth.log to see ssh connections/commands

an improvement might be to reuse a same ssh session. By looking at the code I can see that there's a mux system but not sure if that work, I added prints in the code and it does not seem to be enabled (maybe only in local haven't really checked)

Yes i have that in my .bashrc

RebootGG commented 2 months ago

Same issue here with a Scaleway Instance with 3 cores, 4 GB RAM.

I have Coolify + 3 Next.js deployments on this instance.

CPU-wise, everything was working "fine" until I configured health checks for my apps yesterday.

Over the night, CPU usage became crazy and it makes the instance completely unresponsive, I have to restart it each time to connect with SSH. The apps are down as well, obviously.

image.

I am running the v330 version. Is there a stable version on which I can downgrade to?

jurgen-siegel commented 2 months ago

Id love to know as well. I tried 260 and it was the same thing for me. I dropped coolify completely. If theres a fix for this ill go back.

Geczy commented 2 months ago

@RebootGG v323 has been stable for me!

TimKochDev commented 1 month ago

Coolify maxed out my Hetzner server at 200% when I tried to deploy a midsize NodeJS app. Coolify, builds, and app are all on the same server. It had worked before. I restarted the server, upgraded Coolify to v4.0.0-beta.335, and started the deployment again. Now it worked and the deployment never maxed out. Of course, it spiked way above 100% but it always fluctuated and never got stuck at the maximum. So for me, the problem appears to be solved. Thank you!!

TimKochDev commented 1 month ago

Coolify maxed out my Hetzner server at 200% when I tried to deploy a midsize NodeJS app. Coolify, builds, and app are all on the same server. It had worked before. I restarted the server, upgraded Coolify to v4.0.0-beta.335, and started the deployment again. Now it worked and the deployment never maxed out. Of course, it spiked way above 100% but it always fluctuated and never got stuck at the maximum. So for me, the problem appears to be solved. Thank you!!

Nope, today the server maxed out again while deploying/building. Coolify and the projects hosted on the service were not available from the internet anymore and instead showed a 502 Bad Gateway. What logs can I provide you with?

RebootGG commented 1 month ago

I'll edit this message if a problem occurs, but so far running Coolify in a dedicated instance seems to fix this issue.

I'm deploying my apps on a separated server and both instances seem to be running fine.

(This is not an ideal workaround in many cases obviously)

wolf-code-de commented 1 month ago

I had the same problem and had to hard reset the service while the build process started. Configuring the swap file that programad suggested helped to fix the problem for me. My server has 4GB of RAM. Should I order one with more RAM instead?

m1daz commented 1 month ago

Bump for this, I gave the server 70GB ram and coolify will use as much as available. It does not matter. These spikes typically ONLY follow deployments. Coolify will NOT spike in memory by itself. Only after deploying/redeploying/restarting.

Fadil3 commented 1 month ago

after the latest update (340), my server can freely breath. Its not 100% usage again on sshd or ssh thank you @peaklabs-dev @Vahor @andrasbacsai for resolve this issue image

Fadil3 commented 1 month ago

After one night, the problem coming back :( image

try to track with ps -ef | grep -w ssh | grep -v grep | grep -v pts and here's the result

9999        8887    6543  0 13:42 ?        00:00:00 ssh -fNM -o ControlMaster=auto -o ControlPath=/var/www/html/storage/app/ssh/mux/j8o4ss0 -o ControlPersist=1h -i /var/www/html/storage/app/ssh/keys/id.root@j8o4ss0 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o PasswordAuthentication=no -o ConnectTimeout=10 -o ServerAliveInterval=20 -o RequestTTY=no -o LogLevel=ERROR -p 22 root@host.docker.internal
9999        8975    6543  0 13:42 ?        00:00:00 ssh: /var/www/html/storage/app/ssh/mux/j8o4ss0 [mux]
Geczy commented 1 month ago

this is the last 7 days, fwiw i don't have this issue of high cpu usage

but i'm also not deploying, so just to confirm, your issues are only after / during deploying?

CleanShot 2024-09-19 at 07 44 01@2x

RedFr4me commented 1 month ago

Same issue happening out of nowhere today for me as well, forced to upgrade vps.