Closed doobry-systemli closed 3 years ago
This is the MR that introduced rate limiting: https://github.com/ether/etherpad-lite/pull/4036
So most likely, the default values are too strict for normal usecase.
Did you disable all plugins before checking? It's possible the values are too strict but it's also possible you have a plugin spamming socket msgs
Do you believe that the issue could be related to plugins? Problem is, our Etherpad instance is used quite a lot and disabling plugins for longer will result in unhappy users :wink:
In case we find an easy reproducer, we'll certainly retry to reproduce with plugins temporarily disabled. But honestly I don't see a possible relation to plugins, do you?
Based on your plugins the only plugin that can be the cause is ep_comments_page I would try to output the padId and message type (it's probably message.padId and message.data.type) during the disconnect
I remember some cursortrace / caret tracking plugins sent lots of messages making it cause rate limit. Not sure if it's still the case.
@JohnMcLear
Any chance it's not client.handshake.address
but client.handshake.headers['x-forwarded-for']
? I have the feeling the first is always the proxy in case of reverse proxy setup?
In the case of trustProxy=true I think it should be client.handshake.address
and if trustProxy=false client.handshake.headers['x-forwarded-for']
To my knowledge, we don't use a cursortrace plugin. Here's the list of installed plugins:
@doobry-systemli we test rate limiting etc. when load testing so we think the value is okay but you can try to increase the value to see if it remedies your problem. I'd like to know if something awkward is going on IE improper config of reverse proxy where the correct IP address is not being shown to the limiter. Alternatively if all of your users come from behind a proxy / gateway where they are natt'd it could rate limit them too.... IE schools...
@JohnMcLear thanks! We indeed run Etherpad behind a Apache2 Reverse Proxy.
For now we didn't get further reports regarding this issue and we also failed to reproduce the issue ouselves after hitting it once. It's indeed possible that the situations where people hit the rate limit where due to people using a pad from the same source IP.
Is the IP-based rate-limiting per pad or instance-wide? In other words, can I hit the limit when others from same source IP are connected to other pads on the same instance?
instance wide.
Confirm that this is a persistent problem for us - a pad used by 2 people - intermittently in different locations. Absolutely normal use. This is a new problem for us.
@RuthCatlow can you please increase the max
value in settings.json under section importExportRateLimiting
and let ujs know if this fixes it for you.
The number of people on the pad doesn't matter so much. By normal use, it's really hard for us to define "normal use" as normal use between one of our users is totally different from others. But I assume by normal you mean Etherpad without plugins?
Sorry John, I should have been clearer I am using a free service at https://pad.riseup.net/
@RuthCatlow then please talk to the riseup team :+1: )
I had no time to get CI for this in, yet. Our current tests with loadtest and import/export basically show that if you increase the value everything is fine.
@JohnMcLear
importExportRateLimiting
is probably fine because it's directly used as express middleware and thus express uses the right IPs. I did not verify that either, though.
RE this issue: The reason I could not test with etherpad-cli-client was, that I was using the disconnect
-event and not listening for messages like {disconnect: 'blabla'}
.
I think the patch in https://github.com/ether/etherpad-lite/compare/fix-ratelimit is correct and fixes the issue of using the reverse proxy's IP. However, it won't disconnect every client that shares the IP, but only the client that sent the message that exceeded the limit. We need to iterate over all the clients or introduce some byIPAddress
-lookup to get this right.
While I think the rate limit for import/export is fine, because those are definitely operations that consume resources, I don't think that rate limiting changesets is the right thing to do. Etherpad is very fast at processing them and we have no way to assess the impact of a changeset (like the no. of attributes) in terms of resource consumption.
imo 2 options:
Will report back in an hour if I had luck to implement the test coverage
@webzwo0i setting loadTest
to true
disables rate limiter.
Lemme know RE fix-ratelimit PR et al.
While I think the rate limit for import/export is fine, because those are definitely operations that consume resources, I don't think that rate limiting changesets is the right thing to do. Etherpad is very fast at processing them and we have no way to assess the impact of a changeset (like the no. of attributes) in terms of resource consumption.
Limiting rate limits to import/export sounds like a sensible thing to do :+1:
Rate limiting changesets is a method to stop a bad actor committing too much damage. It's required imho. I can explain in private over email Webzzzzz if you want?
@webzwo0i setting
loadTest
totrue
disables rate limiter.
WARNING: Setting loadTest
to true
makes it possible for anyone to bypass authentication and authorization checks.
Rate limiting changesets is a method to stop a bad actor committing too much damage. It's required imho.
Agreed.
However, it won't disconnect every client that shares the IP, but only the client that sent the message that exceeded the limit.
I think that's OK. The goal should be to keep the system from buckling under load, not to stop all malicious traffic. Disconnecting a random user each time the limit is exceeded meets that goal.
I don't think that rate limiting changesets is the right thing to do. Etherpad is very fast at processing them
Cheap != free. Early load shedding is critical to providing a service (in general, not just Etherpad) that is reliable when scaled to huge numbers (e.g., millions of concurrent users). Without early load shedding you can experience sudden global outages: One server crashes because the container runs out of memory due to heavy load, then the load that was previously handled by that server shifts to other servers, causing them to OOM. These cascading failures continue until all processes are in an OOM crash loop. Recovering from this is not easy. The longer the process takes to restart the worse the outage is.
Shedding load can (and should) be done before the packets even reach Etherpad (e.g., at the load balancer), but having some load shedding functionality in Etherpad itself has some benefits:
and we have no way to assess the impact of a changeset (like the no. of attributes) in terms of resource consumption.
Are you talking about cost-based rate limiting? (In other words, rate limit according to the amount of CPU, memory, bandwidth, disk I/O, etc. a particular changeset will incur, instead of X changesets per second.) If so, I don't think that's necessary. Thanks to the law of large numbers, if a single server can handle lots of changesets per second relative to the cost variance, then there won't be much variance in resource consumption when limiting by changesets per second instead of cost.
(For simplicity, I'm going to refer to HTTP requests per second plus socket.io messages per second as QPS.)
I think there should be two or three layers of rate limiting:
The server-wide limit can be replaced with a resource-based limit (e.g., the server is considered to be overloaded if it is using more than x% CPU or more than x MiB RAM), but that's probably more difficult to implement in a way that works across all systems (Linux, *BSD, illumos, Windows, etc.).
I think the patch in https://github.com/ether/etherpad-lite/compare/fix-ratelimit is correct and fixes the issue of using the reverse proxy's IP.
This branch doesn't have any commits in it, did it get removed?
@doobry-systemli: did any suggestion work for you? we had to rollback to 1.8.4 as it made our instance impossible to use.
Does increasing commitRateLimiting.points
in settings.json
not work?
@rhansen thanks for your explanation! I must think about this a little bit more when I'm rested. :-) Is my understanding right, that you vote for not kicking everyone that shares the same IP, when a rate limit is reached? In this case I think the patch is ready.
@JohnMcLear In case you have some more input to add to @rhansen's post I'd welcome it.
https://travis-ci.org/github/ether/etherpad-lite/builds/731127336 has a docker setup with a reverse proxy and checks if etherpad takes the real IP or the proxy's address. https://travis-ci.org/github/ether/etherpad-lite/builds/731127483 has the fix applied.
@micah @disturbio @doobry-systemli https://github.com/ether/etherpad-lite/pull/4373 has the PR. Can you test please?
@doobry-systemli: did any suggestion work for you? we had to rollback to 1.8.4 as it made our instance impossible to use.
We just raised commitRateLimiting.points
from 10
to 100
. According to the metrics, not too many users ran into the rate limit before anyway, but we have to investigate a bit further. For the moment I hope that the raised limit works as a workaround :wink:
Is my understanding right, that you vote for not kicking everyone that shares the same IP, when a rate limit is reached?
Yes, your understanding is right.
@rhansen no, it didn't work with our current users (increased adding several zeros at the end).
@webzwo0i will try it later today or tomorrow. thanks for your help with this!
@webzwo0i seems to be working \o/
Can we close this now? @rhansen @webzwo0i
Removing ep_cursortrace solved this problem for me (Etherpad 1.8.6 behind an Apache reverse proxy).
Describe the bug In some occasions (certainly not when typing too fast), a rate limit is detected and etherpad disconnects the client connection. This didn't happen with 1.8.4 and only started showing up after we upgraded to 1.8.6. Several independend users reported the same issue.
To Reproduce No real reproducer found yet. It's not related to the number of people connected to the pad, and not related to the typing velocity. Maybe related to the number of connections per client IP address?
Expected behavior Rate limiting should not hit when pads are used in "a normal way"™
Desktop (please complete the following information)
Additional context The code that is triggered can be found here.
EDIT: The regression was introduced in 1.8.4, not in 1.8.5. We upgraded from 1.8.4 to 1.8.6.