friendica / friendica

Friendica Communications Platform
https://friendi.ca
GNU Affero General Public License v3.0
1.44k stars 341 forks source link

Improve loadavg checking for shared-kernel hosts #10131

Open bkil opened 3 years ago

bkil commented 3 years ago

Bug Description

Admin on shared hosting platforms are reporting that "Friendica can be unstable at times". A simple workaround is to increase loadavg thresholds (maxloadavg and maxloadavg_frontend) to an extremely high value (like 1000).

At least this should be documented, but it would be best to rethink this whole logic.

Background

The aim behind being neighborly is noble, but flawed in multiple levels in most use cases today. I think the original reasoning was that if you are installing on bare metal and your physical box is under memory pressure and is about to swap, any and all optional processes must be stopped, otherwise as the condition worsens, even admins can't log in via ssh to fix the issue. This is usually achieved via containerization and cgroups these days.

Interestingly, in our experience, Friendica is relatively light on memory, though it can get a bit database-heavy. So, actually it would be more useful to monitor database utilization instead, especially if that one is running on a separate host. For example based on some proxies like the currently open database connection count or the number of running transactions.

If we wanted to protect against a flood of users (DDoS or just the "Slashdot effect"), I think it would be best to check metrics that are visible to the end user. For example, new connections should probably not be allowed if the average class init time/response time grew to 10x as normal (it's unfortunate that response time is not very much controlled in Friendica, but I hope you get the idea). The existing min_memory setting is also useful as a safeguard, although I think many would prefer to set it as a proportion.

Issues with loadavg

If we accepted loadavg as a proxy to system utilization on dedicated, physical machines, the limit should be multiplied by the number of cores either way. I think multi-core home computers were just starting to spread around the time Friendica was started, hence this oversight.

However, note that loadavg is shared between tenants in many cases where the kernel is being shared:

(Notable exceptions where the kernel is separate: Xen, KVM or vSphere based VPS)

Note that if nothing else is running in your container/VM than Friendica (and perhaps its database), you shouldn't really care much about neighborliness - fair sharing of the CPU is already taken care of by the PaaS/IaaS provider.

For example, on the LXC VPS we tested, loadavg is usually very low (less than half the number of physical hyperthreads of the blade), but a fork bomb or some BOINC based COVID vaccine search probably got loose at a neighbor and thus we are seeing 8x the loadavg as before, so our previously configured constant became insufficient.

At the same time, our VM is still responding normally, nginx serving static pages under 40ms (maybe with some more variability than usual). So it's only the loadavg number that grew out of bounds - isolation is taking care of the rest.

But if we work around this issue by setting Friendica's loadavg threshold to an extremely high value, we're basically disabling its protection against DDoS, but this is unfortunate in the long run.

Steps to Reproduce

  1. Run Friendica in an LXC container
  2. Start 100 infinite loops on the host or in a separate LXC container

Actual Result:

Expected Result:

MrPetovan commented 3 years ago

Thank you for the detailed request and writeup. I don't think the load average check was meant as a DDoS attack protection. After all, since we're doing this check in PHP, it's already pretty far down the request stack that it still can do damage. It's more a convenience setting to balance background worker tasks and frontend display on dedicated machines. If this setting isn't relevant to you, there's no risk not using it, and any desired DDoS protection should be done at a lower level in the hosting infrastructure anyway.

bkil commented 3 years ago

But am I seeing it incorrectly that you can't disable this check and the default is 20/50? It took quite some time to figure out why our nodes were not "stable".

It either needs a much higher default or it needs a FAQ/documentation entry to help diagnose and solve HTTP 503 problems.

MrPetovan commented 3 years ago

Currently there is no way to disable it other than input a ridiculously high value, and the default values are debatable as well. I was just commenting on the DDoS protection aspect of this settings, or rather its lack thereof.

tobiasd commented 3 years ago

The load check is extremely useful in environments like hosting on the Raspi to keep the background process in check which will otherwise knock out the RasPi with processes. On the RP2b I usually hat load averages of 2 and 8 set for the background and frontend processes. On the 4b one can go into much higher scales.

annando commented 3 years ago

We can add a behaviour that - for example a value of 0 or -1 could completely disable the load checks. But like @tobiasd said: The load check is an essential part of the worker to calculate the available amount of worker tasks, so by default it would kept enabled.

bkil commented 3 years ago

I haven't done any measurements on that one, but I think (and I read it in guides) that the SD card of a Raspberry Pi 2 (and 4) makes it especially I/O limited, and hence blocking everything via the database. So, maybe monitoring some of the above proposed database metrics could still ensure its responsiveness.

Having an option to disable the check may be a step in the right direction, but this only help after someone knows that this is the root cause. Without extensive knowledge, one may not always realize why the worker is never running (or not running for half a day), and why the web servers's access log is full of 503 errors.

To help diagnose this, could we perhaps mention the maxloadavg* settings on the HTTP 503 error page in the browser? As this would only appear intermittently hence needs luck to catch and the user could still not tell why the worker is not running if loadavg is between 20 and 50 without looking at the log, but at least this could give a hint in many cases.

src/App.php throw new HTTPException\ServiceUnavailableException('The node is currently overloaded. Please try again later.');

And perhaps there could be an entry in the FAQ or troubleshooting guide about what to do when seeing HTTP 503 Service Unavailable or when a worker is not running.

bkil commented 3 years ago

By the way, I didn't want to imply that such a check was originally meant to catch a DDoS of thousands of parallel connections per se. What I meant is that I think it is also implicitly used for limiting the number of concurrent user actions, as in case when a dozen users start clicking on the site.

To ensure that legitime users see bounded response times, a site should indeed limit the number of parallel requests that it serves (both through its frontend, and its API and federation backend). My gut feeling as mentioned above is that Friendica is bottlenecked by database I/O (along with PHP boot & class init). So if we terminate processing early on (as done via this loadavg check), we can probably greatly reduce the amount of resources that those extra requests would make above the parallel connection limit.

It's aim was also to reduce database I/O of the workers in peak hours so that the frontend should feel more snappy to users. Maybe there could be alternative solutions to this.

It would be ideal to benchmark how many connections of typical operations a given site could sustain and enforce an explicit connection limit based on that, but that sounds like a lot of work.

I don't think the load average check was meant as a DDoS attack protection. After all, since we're doing this check in PHP, it's already pretty far down the request stack that it still can do damage. [...] , and any desired DDoS protection should be done at a lower level in the hosting infrastructure anyway.

annando commented 3 years ago

BTW: The backend already has got a check of available database connections. But the system's load is - at least under Linux - the best indicator for a system's health.