Open galacoder opened 6 months ago
@BThero i stay with 260 for now
still experiencing this issue just out of the blue high cpu usage and disk throughput coolify and everything hosted on it crashes
still experiencing this issue just out of the blue high cpu usage and disk throughput coolify and everything hosted on it crashes
Can you check if your disk is not full?
Can you check if your disk is not full?
It is 89% full
I'm also experiencing 100% cpu usage on a new install on a 4cpu contabo server. Seems like artisan ist taking up all the CPU.
In my case it could be a storage issue, now its 98% used and crashed
Definitely not a storage issue. I can have no projects running and coolify will no longer be accessible to the dashboard. It takes about 4-7 days then I have to either restart my server or spin down coolify and try again.
same here, i got an 4g 4core vps on OVH, and recently some deployments triggers the whole vps crashed i have to reboot the vps to get it worked
please we need fix for this problem.it destroyed my projects on coolify if i started coolify container it overload cpu and crush the server.
I do have enough free ram. Its also not the swapd process that eats up all the cpu. Thanks for the proposal
It does the same on mine server hetzner I'v 2vcpu 8gb Ram 80 gb disc I don't know what's going on with Coolify
There have been updates recently from a week ago and my coolify has been working great. I haven't reviewed recent updates but got to say improvements have been made my server is still up and running for about week
Quote reply
yes most of the deployments don't have problem, but i just don't know why some of it will crash the vps
I haven't had an outage now since June 27th, the latest versions of Coolify seem to have had a positive impact on the stability
I haven't had an outage now since June 27th, the latest versions of Coolify seem to have had a positive impact on the stability
can i know your vps configuration?
I haven't had an outage now since June 27th, the latest versions of Coolify seem to have had a positive impact on the stability
can i know your vps configuration?
This is a work setup so might not be suitable for many use cases, but I'm currently running 3 servers:
Build server: 2 vCPUs, 4 GB RAM, 80 GB Disk (Digital Ocean Premium Intel) Apps Server: 2 vCPUs, 4 GB RAM, 80 GB Disk(Digital Ocean Regular Intel) Coolify Server: 2 vCPUs, 2 GB, 25 GB Disk
I also allocated swap space to the servers, 2GB on the larger servers and 1GB on the Coolify one.
We used to have crashes weekly with this setup but haven't for around 2 months now even with pushing 15-20 deployments out a day
my website is extremely slow. not sure if it has anything to do with that
It has! Myhad a response time of about 2-7 seconds per page load. After going back to version 260 everything was fine again
i am on 323 i will try 260
It has! Myhad a response time of about 2-7 seconds per page load. After going back to version 260 everything was fine again
mine was pretty fast once the deployment is done, maybe because mine is a simple website? you can try to deploy on vercel to see if it's the coolify problem or maybe your backend was not that quick? or your bundle size is too large?
which version of coolify are you guys using?
which version of coolify are you guys using?
i'm using the 323
which version of coolify are you guys using?
i'm using the 323
is it slow for you? page load on my website is 2-7 seconds just like @swissbyte mentioned
which version of coolify are you guys using?
i'm using the 323
is it slow for you? page load on my website is 2-7 seconds just like @swissbyte mentioned
for me, i ran some Backend system (some dashboard with different use cases) website using next.js and react, it's all front-end backend separated, the backend i wrote it on FastAPI, so i guess it's not that heavy? and i put all the pagination and sort and filter in the backend so it won't throw all data in the same time, and my vps is 4g 4core 80g with OVH, so once it's deployed, it's pretty quick, i don't know what kinda website you deployed, but if you can provide the Network screen shot in browser, it would be better
Thank you for absolutely flooding my email. I think now is a good time to remind everyone that some people are following this issue to know when it's fixed, and it's being worked on. See what @ andrasbacsai answered weeks ago.
Please guys, use the comment only for anything useful, if you're a specific case but not general discussion. This is just spam at this point.
I prioritize this issue.
I anyone still have a server where it occurs and would like to help me debugging, do not hesitate to write me on Discord, or https://coolify.io/docs/contact
Until then, I do my best to reproduce and found the bottleneck. I already fixed a few, but still unsure what is causing it exactly.
Hello, @andrasbacsai,
I have the same problem. This is the screenshot from htop
:
I've tried to strace
the PID
and here are the results:
epoll_pwait(4, [], 1024, 289, NULL, 8) = 0
And:
epoll_pwait(4, [{events=EPOLLIN, data={u32=16, u64=16}}], 1024, 500, NULL, 8) = 1
read(16, "{\"jsonrpc\":\"2.0\",\"method\":\"job\","..., 65536) = 354
mmap(NULL, 69632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa3323ec000
munmap(0x7fa3323ec000, 69632)
I've tried to kill the process, but it keeps starting over and over again. This suggests that the process is likely being respawned or restarted automatically, perhaps as part of a system service or a background task.
Could you please investigate this issue and provide a fix or a workaround? The high CPU and RAM usage is causing significant performance problems on my system.
Please let me know if you need any additional information or if there's anything else I can do to assist with the investigation.
Thank you for your attention to this matter.
Hello, @andrasbacsai,
I have the same problem. This is the screenshot from
htop
:I've tried to
strace
thePID
and here are the results:epoll_pwait(4, [], 1024, 289, NULL, 8) = 0
And:
epoll_pwait(4, [{events=EPOLLIN, data={u32=16, u64=16}}], 1024, 500, NULL, 8) = 1 read(16, "{\"jsonrpc\":\"2.0\",\"method\":\"job\","..., 65536) = 354 mmap(NULL, 69632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa3323ec000 munmap(0x7fa3323ec000, 69632)
I've tried to kill the process, but it keeps starting over and over again. This suggests that the process is likely being respawned or restarted automatically, perhaps as part of a system service or a background task.
Could you please investigate this issue and provide a fix or a workaround? The high CPU and RAM usage is causing significant performance problems on my system.
Please let me know if you need any additional information or if there's anything else I can do to assist with the investigation.
Thank you for your attention to this matter.
Thank you! It contains a few useful information!
Can you maybe grab what is exactly executed in the ssh process? I know it is one of the jobs, but I could not identify which one.
Are you on the latest version (325)?
Hello, @andrasbacsai, I have the same problem. This is the screenshot from
htop
: I've tried tostrace
thePID
and here are the results:epoll_pwait(4, [], 1024, 289, NULL, 8) = 0
And:
epoll_pwait(4, [{events=EPOLLIN, data={u32=16, u64=16}}], 1024, 500, NULL, 8) = 1 read(16, "{\"jsonrpc\":\"2.0\",\"method\":\"job\","..., 65536) = 354 mmap(NULL, 69632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa3323ec000 munmap(0x7fa3323ec000, 69632)
I've tried to kill the process, but it keeps starting over and over again. This suggests that the process is likely being respawned or restarted automatically, perhaps as part of a system service or a background task. Could you please investigate this issue and provide a fix or a workaround? The high CPU and RAM usage is causing significant performance problems on my system. Please let me know if you need any additional information or if there's anything else I can do to assist with the investigation. Thank you for your attention to this matter.
Thank you! It contains a few useful information!
Can you maybe grab what is exactly executed in the ssh process? I know it is one of the jobs, but I could not identify which one.
Are you on the latest version (325)?
i got this in ssh log
Sep 03 22:06:41 vmi1685278.contaboserver.net sshd[17039]: Failed password for root from 146.71.50.198 port 56530 ssh2
Sep 03 22:06:44 vmi1685278.contaboserver.net sshd[17039]: Connection closed by authenticating user root 146.71.50.198 port 56530 [preauth]
Sep 03 22:06:57 vmi1685278.contaboserver.net sshd[18474]: Connection from 61.177.172.181 port 61353 on [REDACTED] port 22 rdomain ""
Sep 03 22:07:06 vmi1685278.contaboserver.net sshd[19191]: Connection from 172.18.0.10 port 55642 on 172.17.0.1 port 22 rdomain ""
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Accepted key [REDACTED] found at /root/.ssh/authorized_keys:2
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Postponed publickey for root from 172.18.0.10 port 55642 ssh2 [preauth]
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Accepted key [REDACTED] found at /root/.ssh/authorized_keys:2
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Accepted publickey for root from 172.18.0.10 port 55642 ssh2: [REDACTED]
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[18474]: Received disconnect from 61.177.172.181 port 61353:11: [preauth]
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[18474]: Disconnected from authenticating user root 61.177.172.181 port 61353 [preauth]
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Starting session: command for root from 172.18.0.10 port 55642 id 0
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Close session: user root from 172.18.0.10 port 55642 id 0
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Received disconnect from 172.18.0.10 port 55642:11: disconnected by user
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: Disconnected from user root 172.18.0.10 port 55642
Sep 03 22:07:07 vmi1685278.contaboserver.net sshd[19191]: pam_unix(sshd:session): session closed for user root
Sep 03 22:07:08 vmi1685278.contaboserver.net sshd[19280]: Connection from 172.18.0.10 port 55648 on 172.17.0.1 port 22 rdomain ""
Sep 03 22:07:08 vmi1685278.contaboserver.net sshd[19280]: Accepted key [REDACTED] found at /root/.ssh/authorized_keys:2
Sep 03 22:07:09 vmi1685278.contaboserver.net sshd[19280]: Postponed publickey for root from 172.18.0.10 port 55648 ssh2 [preauth]
Sep 03 22:07:09 vmi1685278.contaboserver.net sshd[19280]: Accepted key [REDACTED] found at /root/.ssh/authorized_keys:2
Sep 03 22:07:09 vmi1685278.contaboserver.net sshd[19280]: Accepted publickey for root from 172.18.0.10 port 55648 ssh2: [REDACTED]
Sep 03 22:07:09 vmi1685278.contaboserver.net sshd[19280]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: Connection from 172.18.0.10 port 55664 on 172.17.0.1 port 22 rdomain ""
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: Accepted key [REDACTED] found at /root/.ssh/authorized_keys:2
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: Postponed publickey for root from 172.18.0.10 port 55664 ssh2 [preauth]
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: Accepted key [REDACTED] found at /root/.ssh/authorized_keys:2
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: Accepted publickey for root from 172.18.0.10 port 55664 ssh2: [REDACTED]
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Sep 03 22:07:12 vmi1685278.contaboserver.net sshd[19604]: Starting session: command for root from 172.18.0.10 port 55664 id 0
Sep 03 22:07:23 vmi1685278.contaboserver.net sshd[20586]: Connection from 61.177.172.181 port 21578 on [REDACTED] port 22 rdomain ""
is this a bruteforce SSH attack?
Yes, i use latest version
@Fadil3 No this is most like not a brute force attack, as there is only 1 failed login attempt. The connections via SSH are authenticated via public key and come from an internal IP 172.18.0.10
this is most likely a docker container. It is a bit to frequent for my taste though.
If you self host could you maybe check https://COLLIFY_IP:8000/horizon so we can see what jobs are run?
Dev note: @andrasbacsai we believe this is caused by a Job right? Opening so many SSH connections in so less seconds has to be a script or a something CI/CD related (automated). I will investigate more.
This is screenshot from the horizon @peaklabs-dev @andrasbacsai
Edit: This was just me not thinking while tired, it is the normal compose healthcheck.
After some digging I found something strange. Not sure if this is related but when I deploy ClassicPress with MariaDB, and check the container logs of the ClassicPress container, there is a call from localhost every 2s:
This snippet is a small part, it goes on forever like this.
127.0.0.1 - - [04/Sep/2024:11:54:17 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:20 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:22 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:24 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:26 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:28 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:30 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:32 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:34 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:36 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:38 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:40 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:42 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:44 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:46 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:48 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:50 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:52 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:54 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:56 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:54:58 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:00 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:02 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:04 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:06 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:08 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:10 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:13 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:15 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:17 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:19 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:21 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:23 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:25 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:27 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:29 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:31 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:33 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:35 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:37 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:39 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:41 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:43 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:45 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:47 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:49 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:51 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:53 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
127.0.0.1 - - [04/Sep/2024:11:55:55 +0000] "GET / HTTP/1.1" 302 318 "-" "curl/7.74.0"
-> This could make the site really slow and a health check should not be that frequent, right?
@peaklabs-dev That's the default healthcheck for the container https://github.com/coollabsio/coolify/blob/3f34df251e01170e8716e8ee2eef3eaa432f5591/templates/compose/classicpress-with-mariadb.yaml#L19-L23
Just had this bug. Hetzner graphs skyrocketed, up to 200% cpu. And top
command shows information that maybe can help:
The high CPU on kswapd0 indicates that I am using too much swap and if you look close, there is a high memory usage on npm ci
. I think this started when I commited to a github repo and it triggered a deploy on Coolify and for some reason, it is stuck on that command.
Other possibility is Coolify Upgrade. Before the crash, Coolify was showing the Upgrade button (I have the auto update turned on). Maybe it tried to upgrade and conflicted with something, maybe my github deploy? Both at the same time? Don't know.
Tried freeing some swap and it didn't work. I'll try to kill the npm ci
.
I fixed my situation following this procedure suggested by an LLM:
The system has 0 MiB of swap available, which means swapping is disabled, yet kswapd0
is still active and consuming CPU. This behavior can occur due to misconfigured virtual memory settings or low available RAM.
Ensure Enough RAM:
Your system has around 3.8 GB of RAM, and only 8.9 MB is available. It seems the system is under memory pressure, causing kswapd0
to trigger swapping behavior despite swap being unavailable.
npm ci
, vector
, docker
, beam.smp
, and node
processes.Re-enable Swap:
Since swap space is currently 0 MiB (disabled), re-enabling swap can help reduce the burden on the physical memory and lower CPU usage by kswapd0
.
To check if a swap file exists:
sudo swapon --show
If there’s no swap space, you can create and enable a swap file:
sudo fallocate -l 4G /swapfile # Adjust size as needed
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
To make this change permanent, add the swap file to /etc/fstab
:
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
Adjust swappiness
:
Once swap is enabled, you can adjust the swappiness value to control how aggressively the kernel swaps out memory.
Set swappiness
to a lower value (e.g., 10
) to reduce swap usage:
sudo sysctl vm.swappiness=10
Make it permanent by adding to /etc/sysctl.conf
:
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
Investigate Memory-Intensive Processes:
From the output, npm ci
, docker
, beam.smp
, and node
processes are consuming a significant amount of resources. You might want to:
Kill non-essential processes to free up memory:
sudo kill <PID>
(Replace <PID>
with the process ID of the resource-hungry process.)
By enabling swap and optimizing resource usage, you should see a significant reduction in CPU usage by kswapd0
.
My graphs now are going back to normal:
Bonus: now Coolify is up again and I confirmed it was not the Coolify updating, it is still showing the Upgrade button.
@Fadil3 can you check that in your bashrc there's something like
# If not running interactively, don't do anything
case $- in
*i*) ;;
*) return;;
esac
or a shorter version
# If not running interactively, don't do anything
[ -z "$PS1" ] && return
As coolify use a lot of ssh connections, if you have a bashrc that does something on a user login (even if we request no tty it's still used) this might have an impact. Depending on how coolify handle this (on large compose file / multiple deployment etc) this can be an issue.
Note: I did a test on one of my server. And every minute, coolify run 5 commands:
ls /
docker container ls -q
docker inspect [id1] [id2] .. --format '{{json .}}'
ls /
(twice, not a typo)docker inspect --format '{{json .}}' coolify-proxy
(on that server the proxy is voluntarily stopped, so this shouldn't happen)Done by doing a tail -f on /var/log/auth.log to see ssh connections/commands
an improvement might be to reuse a same ssh session. By looking at the code I can see that there's a mux system but not sure if that work, I added prints in the code and it does not seem to be enabled (maybe only in local haven't really checked)
any update on this ? I face the same issue using hetzner
@Fadil3 can you check that in your bashrc there's something like
# If not running interactively, don't do anything case $- in *i*) ;; *) return;; esac
or a shorter version
# If not running interactively, don't do anything [ -z "$PS1" ] && return
As coolify use a lot of ssh connections, if you have a bashrc that does something on a user login (even if we request no tty it's still used) this might have an impact. Depending on how coolify handle this (on large compose file / multiple deployment etc) this can be an issue.
Note: I did a test on one of my server. And every minute, coolify run 5 commands:
ls /
docker container ls -q
docker inspect [id1] [id2] .. --format '{{json .}}'
ls /
(twice, not a typo)docker inspect --format '{{json .}}' coolify-proxy
(on that server the proxy is voluntarily stopped, so this shouldn't happen)Done by doing a tail -f on /var/log/auth.log to see ssh connections/commands
an improvement might be to reuse a same ssh session. By looking at the code I can see that there's a mux system but not sure if that work, I added prints in the code and it does not seem to be enabled (maybe only in local haven't really checked)
Yes i have that in my .bashrc
Same issue here with a Scaleway Instance with 3 cores, 4 GB RAM.
I have Coolify + 3 Next.js deployments on this instance.
CPU-wise, everything was working "fine" until I configured health checks for my apps yesterday.
Over the night, CPU usage became crazy and it makes the instance completely unresponsive, I have to restart it each time to connect with SSH. The apps are down as well, obviously.
.
I am running the v330 version. Is there a stable version on which I can downgrade to?
Id love to know as well. I tried 260 and it was the same thing for me. I dropped coolify completely. If theres a fix for this ill go back.
@RebootGG v323 has been stable for me!
Coolify maxed out my Hetzner server at 200% when I tried to deploy a midsize NodeJS app. Coolify, builds, and app are all on the same server. It had worked before. I restarted the server, upgraded Coolify to v4.0.0-beta.335, and started the deployment again. Now it worked and the deployment never maxed out. Of course, it spiked way above 100% but it always fluctuated and never got stuck at the maximum. So for me, the problem appears to be solved. Thank you!!
Coolify maxed out my Hetzner server at 200% when I tried to deploy a midsize NodeJS app. Coolify, builds, and app are all on the same server. It had worked before. I restarted the server, upgraded Coolify to v4.0.0-beta.335, and started the deployment again. Now it worked and the deployment never maxed out. Of course, it spiked way above 100% but it always fluctuated and never got stuck at the maximum. So for me, the problem appears to be solved. Thank you!!
Nope, today the server maxed out again while deploying/building. Coolify and the projects hosted on the service were not available from the internet anymore and instead showed a 502 Bad Gateway. What logs can I provide you with?
I'll edit this message if a problem occurs, but so far running Coolify in a dedicated instance seems to fix this issue.
I'm deploying my apps on a separated server and both instances seem to be running fine.
(This is not an ideal workaround in many cases obviously)
I had the same problem and had to hard reset the service while the build process started. Configuring the swap file that programad suggested helped to fix the problem for me. My server has 4GB of RAM. Should I order one with more RAM instead?
Bump for this, I gave the server 70GB ram and coolify will use as much as available. It does not matter. These spikes typically ONLY follow deployments. Coolify will NOT spike in memory by itself. Only after deploying/redeploying/restarting.
after the latest update (340), my server can freely breath. Its not 100% usage again on sshd
or ssh
thank you @peaklabs-dev @Vahor @andrasbacsai for resolve this issue
After one night, the problem coming back :(
try to track with ps -ef | grep -w ssh | grep -v grep | grep -v pts
and here's the result
9999 8887 6543 0 13:42 ? 00:00:00 ssh -fNM -o ControlMaster=auto -o ControlPath=/var/www/html/storage/app/ssh/mux/j8o4ss0 -o ControlPersist=1h -i /var/www/html/storage/app/ssh/keys/id.root@j8o4ss0 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o PasswordAuthentication=no -o ConnectTimeout=10 -o ServerAliveInterval=20 -o RequestTTY=no -o LogLevel=ERROR -p 22 root@host.docker.internal
9999 8975 6543 0 13:42 ? 00:00:00 ssh: /var/www/html/storage/app/ssh/mux/j8o4ss0 [mux]
this is the last 7 days, fwiw i don't have this issue of high cpu usage
but i'm also not deploying, so just to confirm, your issues are only after / during deploying?
Same issue happening out of nowhere today for me as well, forced to upgrade vps.
Description
I have been running Coolify for 8 days with various services, encountering no prior issues. However, on the night of April 30th EST, I experienced a significant CPU usage spike starting around 11 PM, shortly after an unsuccessful attempt to deploy a React application. It is unclear whether this issue was directly related to the deployment failure, a potential attack, or another problem.
Expected Behavior
CPU usage should remain stable, without significant spikes, particularly when no active deployments or heavy tasks are underway.
Actual Behavior
CPU usage unexpectedly spiked to over 300% and remained high throughout the night, which was unusual and concerning, given the context.
Environment
Additional Context
The sudden surge in CPU usage occurred post the deployment failure, but it is uncertain if the spike was a direct result of this event, a security issue, or another underlying problem. This incident warrants further investigation to prevent future occurrences.
I already tried to restart my VPS 2 times, but the problem still insisted.
I would appreciate any insights or troubleshooting steps you could recommend to help identify and resolve the root cause of this spike. Thank you for your assistance.
Minimal Reproduction (if possible, example repository)
Steps to Reproduce
Exception or Error
Screenshots
Version
4.0.0-beta.271