bitnami / vms

Bitnami VMs
https://bitnami.com
Other
206 stars 43 forks source link

[Wordpress] CPU spikes every day or two so high that I can't even SSH to my instance #1445

Closed ncosentino closed 5 months ago

ncosentino commented 6 months ago

Platform

AWS

bndiagnostic ID know more about bndiagnostic ID

4a46ad1d-efb5-609a-fdc8-54f745baa97e

bndiagnostic output

I didn't get any diagnostic information printed, just the following:

The diagnostic bundle was uploaded successfully to the Bitnami servers. Please copy the following code: 4a46ad1d-efb5-609a-fdc8-54f745baa97e And paste it in your Bitnami Support ticket.

bndiagnostic was not useful. Could you please tell us why?

I never got to see any output, unfortunately

Describe your issue as much as you can

My Bitnami Lightsail instance works awesome except for the fact that I need to be on standby to reboot it every 24-48 hours. I'm at my wit's end now, so I'm reaching out for some support on this. It seems to be happening mostly late in the evening (9 or 11 PM, or early in the morning 5-6 AM).

I've included a recent snapshot that unfortunately trims out some of the spikes crossing my 20% CPU alarm boundary because I assume I caught them quickly enough and there weren't sufficient data points image (23)

90% of the time when this happens, I have no choice but to reboot from the Lightsail console. I can't SSH from my own client or from the Lightsail webpage. It's completely inaccessible as it eats all of the CPU it can. I've had to wait over 30 minutes for an instance to come back too, but usually I'm back online in 5-10 minutes. (that's still 5-10 minutes too long for downtime).

The other 10% of the time I suspect the Bitnami stack services get nuked because the CPU drops to zero - as in nothing but the OS is running. When this happens I can SSH in again, but checking the bitnami services with the ctl script shows none are running. If I restart the services with the script it's all back to normal -- until the next 24-48 hours.

I have not been able to correlate this with anything in particular. I have some speculations if it's plugin-related, but I have no idea where to start since I can't get any time periods to match:

Given that it seems to happen mostly at those hours I mentioned, I wanted to assume it's something on a corn job or something. But definitely looking for some help. I spend very little time on Linux, but I have a software engineering background so I'm happy to poke around in whatever I need to if someone can point me in the right direction.

I suspect it's in the diagnostic bundle, but in case it helps, the website is here: https://www.devleader.ca

Thanks very much for any assistance.

jotamartos commented 6 months ago

I suggest you take a look at these guides to know if there is any performance issue or if there is a bot/attacker accessing your site with malicious intentions

https://docs.bitnami.com/general/faq/troubleshooting/troubleshoot-server-performance/ https://docs.bitnami.com/general/apps/wordpress/troubleshooting/deny-connections-bots-apache/

Apart from that, if WordFence doesn't return any malicious code, I suggest you temporarily disable the plugins in WordPress and confirm if that's the root cause of the issue.

ncosentino commented 6 months ago

Thanks! I'll check out these links. Currently, I've disabled some WordFence features to experiment and see if that eliminates these death spikes. Will report back if I find it's stabilized, and if so, then I can reach out to the WordFence team for more information. Might be something compatibility-wise that I can share back here for the community.

Since disabling those features, no big failures... so it's a starting point.

ncosentino commented 6 months ago

Following up here again... I had a pretty good few days after turning off WordFence scanning and thought I was in the clear. But in the past few days, I've had CPU spikes that go up a ton and end up tanking all of the services. In some cases I have been able to SSH in right from Lightsail, in other cases I have to reboot the VM from Lightsail even to connect via SSH.

Cloudflare shows no odd traffic patterns, so I don't think that this is a DDOS/bot issue, although I could be wrong. The diagnostic tool shows that I have low available RAM after a reboot, but the VM has 2GB of RAM which is very odd? Should be plenty? Lightsail doesn't have a dashboard for RAM so... that's unfortunately pretty useless :)

Here's the latest diagnostic tool output ID (I updated it): e41729db-de69-9b77-d1b9-47d779101407

ncosentino commented 6 months ago

Trying to demonstrate some evidence that this is not due to overwhelming traffic or something. This just happened over the last hour, so this data is fresh at least: image

image

These particular issues are not leaving the machine in a weird state where I cannot even SSH into it, so the good news is I no longer need to reboot from the Lightsail portal. The bad news is that I still need to SSH in and restart the Bitnami stack.

If you look closely on the Lightsail graph, you can see after the big spike it plummets to nearly 0% CPU because there's no web services running.

Please let me know if there's any other debug information I can provide. I'm trying to decide if I just bite the bullet and migrate my Wordpress to something hosted because I can't babysit this.

EDIT: I may have misspoken because there weren't other events that caused server outage, but there DOES seem to be a correlation between incoming traffic and CPU burst.

image

Now looking at Cloudflare for traffic: image

So the earlier spikes in CPU (before 11:00) are all because Cloudflare not serving cached results. That REALLY doesn't seem like a high amount on a normal web page to cause an issue in CPU like we're seeing... the 11:40 spike around the time of the outage wasn't even compensating for Cloudflare and looks like it's one of the weaker traffic spikes.

So this might be due to traffic, but I feel like these specs should be able to serve my web pages without issue:

What might be the next thing to investigate here?

ncosentino commented 6 months ago

Following up with more data - this wasn't enough to tank my server, but an interesting spike in load on the server mostly from wp-cron.php and the feed end points...

Here's Lightsail (again, this isn't bringing down the server by any means): image

Here's Cloudflare: image

Just sharing this as a data point because without knowing enough about some of the inner-workings of Wordpress, a quick search suggested that default wp-cron can overload your server and switching to a system level cron is a better solution.

Not sure if that's a good course of action here, just trying to problem-solve in public.

Some thoughts:

jotamartos commented 6 months ago

So this might be due to traffic, but I feel like these specs should be able to serve my web pages without issue:

2 GB RAM 2 vCPUs 60 GB SSD What might be the next thing to investigate here?

Have you checked the Apache's and PHP's log files? You will probably found more information there.

Is there guidance around moving from wp-cron? (I know nothing about this)

We have a guide to disable wp-cron, but you should take a look at the official documentation as well

https://docs.bitnami.com/general/apps/wordpress/configuration/disable-wordpress-cron/

Cloudflare isn't set to ignore my feed for caching... I think if that was cleared only when I publish site updates that would make the most sense, no? My origin shouldn't need to serve it hundreds of times in a small window...

That's something you will need to work with the Cloudflare's team

My admin login is no longer at /wp-admin... so it seems suspicious to have that many things trying to hit an ajax.php file there... Is that safe to auto-block?

According to this thread, you can't block those requests because there may be a plugin using it. Do not know if that's causing the problem.

https://community.cloudflare.com/t/how-to-protect-wordpress-admin-ajax-php-the-best-way-without-breaking-things/606966

github-actions[bot] commented 5 months ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 5 months ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.