coollabsio / coolify

An open-source & self-hostable Heroku / Netlify / Vercel alternative.
https://coolify.io
Apache License 2.0
31.47k stars 1.61k forks source link

[Bug]: High CPU Spike every hour #3226

Open Vahor opened 2 weeks ago

Vahor commented 2 weeks ago

Description

I observed many lag spike every hour and by investigating I found this:

The spike takes 30% of all cores. And even 100% on a small machine with single core. None of these servers are the main server, so it shouldn't be related to the auto update/check. And even if it was, the cron for these jobs is not configured to run every hour.

And from what I see it's starts after a pull of coolify-helper. image image image

(took screens at 15:00)

So I've checked in /horizon/jobs/completed and I can see multiple jobs (that ran at ~15:00):

  1. PullHelperImageJob
  2. ServerCheckJob
  3. DockerCleanupJob (run every 10min but I only have a spike every hour so should not be the issue)
  4. InstallLogDrain (log drain is not installed on these servers but the job appeared like 8 times in the completed job page for 15:00 so it seems important to note ; it seems that the installLogDrain is called every 2s on the queue)
  5. ServerCheckJob
  6. PullTemplatesFromCDN (probably for the main instance; is it possible to disable this ? My app are installed so I don't need to refresh ? A manuel refresh could replace this ?)
  7. CleanupInstanceStuffsJob

Don't really know where to search next, in horizon I don't see any way to filter for a single server. If you know tell me I can share more informations

Minimal Reproduction (if possible, example repository)

It appears on all my servers at the same time, every hour. So a coolify instance on a server (maybe add a second server) and wait for the cpu spike 🏕️

Exception or Error

No response

Version

v4.0.0-beta.323

Cloud?

imiborbas commented 2 weeks ago

My server crashed today, and after some investigation, I have found that it was most likely caused while the worker was doing a PullHelperImageJob.

Screenshot 2024-08-26 at 18 41 41

The screen above is the list of failed jobs, where it can be seen that PullHelperImageJob runs every hour, and it fails every time. The job at the bottom of the list was the offending one that I believe crashed my server, it seems to have taken 1414 seconds to run. All this time the server was unresponsive, I had to restart it from the cloud console to get it back up again.

Tried to dig deeper, but I can't seem to be able to get any logs or stack traces.

ShowtimeProp commented 2 weeks ago

Yes, I am experiencing a lag issue at least once a day and it becomes totally "NOT RESPONSIBLE". I'm very happy with coolify, but it's a pity, because it's very annoying.

imiborbas commented 2 weeks ago

Here is a bit more info that I hope will help:

First, I added a few logging lines to the code to figure out what is happening:

Screenshot 2024-08-27 at 08 57 18

Then, I tried to dispatch the job synchronously and see if there's anything in the logs:

Screenshot 2024-08-27 at 09 00 30

This did not result in anything printed to the logs.

Next thing was to try calling PullHelperImageJob#handle() directly:

Screenshot 2024-08-27 at 09 00 50

This took a little longer to run, and left the following messages in the log, indicating a successful run:

Screenshot 2024-08-27 at 08 59 48

This seems to suggest that there is nothing wrong with the code itself, might be a configuration issue, I don't know.

I hope this info helps a bit, and sorry in case this is not related to the problem posted by OP.

brandnewx commented 1 week ago

Coolify is running run endless loop of cron/maintenance jobs, causing 100% CPU usage constantly.