Regression in 1.8.5: Rate Limited. You sent too many messages to this pad so it disconnected you.

doobry-systemli commented 3 years ago

Describe the bug In some occasions (certainly not when typing too fast), a rate limit is detected and etherpad disconnects the client connection. This didn't happen with 1.8.4 and only started showing up after we upgraded to 1.8.6. Several independend users reported the same issue.

To Reproduce No real reproducer found yet. It's not related to the number of people connected to the pad, and not related to the typing velocity. Maybe related to the number of connections per client IP address?

Expected behavior Rate limiting should not hit when pads are used in "a normal way"™

Desktop (please complete the following information)

OS: Debian Linux (Buster)
Browser: Mozilla Firefox
Version: 80.0.1

Additional context The code that is triggered can be found here.

EDIT: The regression was introduced in 1.8.4, not in 1.8.5. We upgraded from 1.8.4 to 1.8.6.

doobry-systemli commented 3 years ago

This is the MR that introduced rate limiting: https://github.com/ether/etherpad-lite/pull/4036

So most likely, the default values are too strict for normal usecase.

JohnMcLear commented 3 years ago

Did you disable all plugins before checking? It's possible the values are too strict but it's also possible you have a plugin spamming socket msgs

doobry-systemli commented 3 years ago

Do you believe that the issue could be related to plugins? Problem is, our Etherpad instance is used quite a lot and disabling plugins for longer will result in unhappy users :wink:

In case we find an easy reproducer, we'll certainly retry to reproduce with plugins temporarily disabled. But honestly I don't see a possible relation to plugins, do you?

webzwo0i commented 3 years ago

Based on your plugins the only plugin that can be the cause is ep_comments_page I would try to output the padId and message type (it's probably message.padId and message.data.type) during the disconnect

JohnMcLear commented 3 years ago

I remember some cursortrace / caret tracking plugins sent lots of messages making it cause rate limit. Not sure if it's still the case.

webzwo0i commented 3 years ago

@JohnMcLear Any chance it's not client.handshake.address but client.handshake.headers['x-forwarded-for']? I have the feeling the first is always the proxy in case of reverse proxy setup?

webzwo0i commented 3 years ago

In the case of trustProxy=true I think it should be client.handshake.address and if trustProxy=false client.handshake.headers['x-forwarded-for']

doobry-systemli commented 3 years ago

To my knowledge, we don't use a cursortrace plugin. Here's the list of installed plugins:

ep_page_view
ep_comments_page
ep_delete_empty_pads
ep_headings2
ep_markdown
ep_scrolltoanchor
ep_spellcheck
ep_stats_extended
ep_sync_status
ep_table_of_contents

JohnMcLear commented 3 years ago

@doobry-systemli we test rate limiting etc. when load testing so we think the value is okay but you can try to increase the value to see if it remedies your problem. I'd like to know if something awkward is going on IE improper config of reverse proxy where the correct IP address is not being shown to the limiter. Alternatively if all of your users come from behind a proxy / gateway where they are natt'd it could rate limit them too.... IE schools...

doobry-systemli commented 3 years ago

@JohnMcLear thanks! We indeed run Etherpad behind a Apache2 Reverse Proxy.

For now we didn't get further reports regarding this issue and we also failed to reproduce the issue ouselves after hitting it once. It's indeed possible that the situations where people hit the rate limit where due to people using a pad from the same source IP.

Is the IP-based rate-limiting per pad or instance-wide? In other words, can I hit the limit when others from same source IP are connected to other pads on the same instance?

JohnMcLear commented 3 years ago

instance wide.

RuthCatlow commented 3 years ago

Confirm that this is a persistent problem for us - a pad used by 2 people - intermittently in different locations. Absolutely normal use. This is a new problem for us.

JohnMcLear commented 3 years ago

@RuthCatlow can you please increase the max value in settings.json under section importExportRateLimiting and let ujs know if this fixes it for you.

The number of people on the pad doesn't matter so much. By normal use, it's really hard for us to define "normal use" as normal use between one of our users is totally different from others. But I assume by normal you mean Etherpad without plugins?

RuthCatlow commented 3 years ago

Sorry John, I should have been clearer I am using a free service at https://pad.riseup.net/

JohnMcLear commented 3 years ago

@RuthCatlow then please talk to the riseup team :+1: )

webzwo0i commented 3 years ago

I had no time to get CI for this in, yet. Our current tests with loadtest and import/export basically show that if you increase the value everything is fine. @JohnMcLear importExportRateLimiting is probably fine because it's directly used as express middleware and thus express uses the right IPs. I did not verify that either, though. RE this issue: The reason I could not test with etherpad-cli-client was, that I was using the disconnect-event and not listening for messages like {disconnect: 'blabla'}.

I think the patch in https://github.com/ether/etherpad-lite/compare/fix-ratelimit is correct and fixes the issue of using the reverse proxy's IP. However, it won't disconnect every client that shares the IP, but only the client that sent the message that exceeded the limit. We need to iterate over all the clients or introduce some byIPAddress-lookup to get this right.

While I think the rate limit for import/export is fine, because those are definitely operations that consume resources, I don't think that rate limiting changesets is the right thing to do. Etherpad is very fast at processing them and we have no way to assess the impact of a changeset (like the no. of attributes) in terms of resource consumption.

imo 2 options:

drop the rate limit for changesets for the moment, advise people to use rate limit on a HTTP requests/IP-level and investigate if there is a way to assess resource consumption to detect crafted changesets that consume resources in contentcollector
write test coverage to detect if the above mentioned really fixes the issue and implement logic to kick every other client that shares the IP.

Will report back in an hour if I had luck to implement the test coverage

JohnMcLear commented 3 years ago

@webzwo0i setting loadTest to true disables rate limiter.

Lemme know RE fix-ratelimit PR et al.

doobry-systemli commented 3 years ago

While I think the rate limit for import/export is fine, because those are definitely operations that consume resources, I don't think that rate limiting changesets is the right thing to do. Etherpad is very fast at processing them and we have no way to assess the impact of a changeset (like the no. of attributes) in terms of resource consumption.

Limiting rate limits to import/export sounds like a sensible thing to do :+1:

JohnMcLear commented 3 years ago

Rate limiting changesets is a method to stop a bad actor committing too much damage. It's required imho. I can explain in private over email Webzzzzz if you want?

rhansen commented 3 years ago

@webzwo0i setting loadTest to true disables rate limiter.

WARNING: Setting loadTest to true makes it possible for anyone to bypass authentication and authorization checks.

Rate limiting changesets is a method to stop a bad actor committing too much damage. It's required imho.

Agreed.

However, it won't disconnect every client that shares the IP, but only the client that sent the message that exceeded the limit.

I think that's OK. The goal should be to keep the system from buckling under load, not to stop all malicious traffic. Disconnecting a random user each time the limit is exceeded meets that goal.

I don't think that rate limiting changesets is the right thing to do. Etherpad is very fast at processing them

Cheap != free. Early load shedding is critical to providing a service (in general, not just Etherpad) that is reliable when scaled to huge numbers (e.g., millions of concurrent users). Without early load shedding you can experience sudden global outages: One server crashes because the container runs out of memory due to heavy load, then the load that was previously handled by that server shifts to other servers, causing them to OOM. These cascading failures continue until all processes are in an OOM crash loop. Recovering from this is not easy. The longer the process takes to restart the worse the outage is.

Shedding load can (and should) be done before the packets even reach Etherpad (e.g., at the load balancer), but having some load shedding functionality in Etherpad itself has some benefits:

Defense in depth. (For services that are small enough to not use load balancers this might be the only defense available.)
Graceful degradation under heavy load. (Example: higher update latency due to changeset batching at the client vs. 502 bad gateway.)
Faster reaction to load spikes, which helps avoid load oscillation.
Safer to run each server close to its capacity, which reduces overprovisioning overhead.

and we have no way to assess the impact of a changeset (like the no. of attributes) in terms of resource consumption.

Are you talking about cost-based rate limiting? (In other words, rate limit according to the amount of CPU, memory, bandwidth, disk I/O, etc. a particular changeset will incur, instead of X changesets per second.) If so, I don't think that's necessary. Thanks to the law of large numbers, if a single server can handle lots of changesets per second relative to the cost variance, then there won't be much variance in resource consumption when limiting by changesets per second instead of cost.

rhansen commented 3 years ago

(For simplicity, I'm going to refer to HTTP requests per second plus socket.io messages per second as QPS.)

I think there should be two or three layers of rate limiting:

Server-wide total QPS limit. (defense against overload collapse)
QPS limit per client IP address. (defense against malicious users)
QPS limit per socket.io connection. (optional, defense against bugs in client code)

The server-wide limit can be replaced with a resource-based limit (e.g., the server is considered to be overloaded if it is using more than x% CPU or more than x MiB RAM), but that's probably more difficult to implement in a way that works across all systems (Linux, *BSD, illumos, Windows, etc.).

micah commented 3 years ago

I think the patch in https://github.com/ether/etherpad-lite/compare/fix-ratelimit is correct and fixes the issue of using the reverse proxy's IP.

This branch doesn't have any commits in it, did it get removed?

disturbio commented 3 years ago

@doobry-systemli: did any suggestion work for you? we had to rollback to 1.8.4 as it made our instance impossible to use.

webzwo0i commented 3 years ago

@micah It's up again: ~~https://github.com/ether/etherpad-lite/compare/fix-ratelimit#diff-5d2aa4336c4964b828a123e20e09b73c~~

See https://github.com/ether/etherpad-lite/pull/4373

rhansen commented 3 years ago

Does increasing commitRateLimiting.points in settings.json not work?

webzwo0i commented 3 years ago

@rhansen thanks for your explanation! I must think about this a little bit more when I'm rested. :-) Is my understanding right, that you vote for not kicking everyone that shares the same IP, when a rate limit is reached? In this case I think the patch is ready.

@JohnMcLear In case you have some more input to add to @rhansen's post I'd welcome it.

https://travis-ci.org/github/ether/etherpad-lite/builds/731127336 has a docker setup with a reverse proxy and checks if etherpad takes the real IP or the proxy's address. https://travis-ci.org/github/ether/etherpad-lite/builds/731127483 has the fix applied.

@micah @disturbio @doobry-systemli https://github.com/ether/etherpad-lite/pull/4373 has the PR. Can you test please?

doobry-systemli commented 3 years ago

@doobry-systemli: did any suggestion work for you? we had to rollback to 1.8.4 as it made our instance impossible to use.

We just raised commitRateLimiting.points from 10 to 100. According to the metrics, not too many users ran into the rate limit before anyway, but we have to investigate a bit further. For the moment I hope that the raised limit works as a workaround :wink:

rhansen commented 3 years ago

Is my understanding right, that you vote for not kicking everyone that shares the same IP, when a rate limit is reached?

Yes, your understanding is right.

disturbio commented 3 years ago

@rhansen no, it didn't work with our current users (increased adding several zeros at the end).

@webzwo0i will try it later today or tomorrow. thanks for your help with this!

disturbio commented 3 years ago

@webzwo0i seems to be working \o/

JohnMcLear commented 3 years ago

Can we close this now? @rhansen @webzwo0i

dcht00 commented 3 years ago

Removing ep_cursortrace solved this problem for me (Etherpad 1.8.6 behind an Apache reverse proxy).

ether / etherpad-lite

Regression in 1.8.5: Rate Limited. You sent too many messages to this pad so it disconnected you. #4340