Closed kcondon closed 10 months ago
For Harvard Dataverse we have been talking about investigating rate limiting solutions offered by AWS and I just pushed b1b703a to mention the new "Rate-Based Rules" offering that's part of AWS WAF (Web Application Firewall). This blog post provides a good overview: https://aws.amazon.com/blogs/aws/protect-web-sites-services-using-rate-based-rules-for-aws-waf/
At standup this morning I inquired if there is any specific technical plan or approach and got feedback that we are fine with an AWS-specific solution for now so I went ahead and made pull request IQSS/dataverse#4693 based on the commit I mentioned above and moved this issue to code review at https://waffle.io/IQSS/dataverse
Thanks @pdurbin. I re-titled the issue to reflect that this is AWS-specific. We'll want a general solution at some point, but I think it's good to get this small chunk tested and out in a release.
Talked after standup this morning. The approach is good, but we need some boxes from LTS to test. @landreev will look into this with LTS. @djbrooke will get involved if a credit card or something is needed :)
I sent a note to LTS:
We've been thinking about creating a test AWS setup that would mock the production; for testing new releases before they go into production.
Something like a low-power instance or two. And we specifically want it to sit behind an ELB - in order to be able to test load-balancing and rate-limiting mechanisms. (This is one thing we have no way of testing as of now).
Is this something you could help us setting up, or would you recommend that we just set it up on our own?
(by the time I hit send I kinda felt like I was maybe pushing it with them... well, if that's the case they'll tell us to do it ourselves and we will. but I figured I'd ask)
Meeting with LTS on Wednesday, will discuss.
@matthew-a-dunlap The test cluster is made up of 2 app nodes: dvn-cloud-dev-1.lts.harvard.edu dvn-cloud-dev-2.lts.harvard.edu I created a shell account for you, with the username mdunlap an sudo powers. I'll slack the password to you. The elb for the cluster is https://dvn-dev.lts.harvard.edu/
Both nodes are using the database on dvn-cloud-dev-1.
This story is mostly blocked until we hear back from LTS about access to the web console, they only provided us access to the boxes themselves.
I'll can do some deeper research into web application firewall in the meantime.
I just emailed Tom (Scorpa) directly.
Got this from Tom:
I'm assuming access to enable/disable nodes in the ELB? Yes, that is doable, I will ask Sharon for the sign off when she's back next week.
(I suggested to get you two directly in touch too)
I'm starting to investigate more deeply the AWS approach we could take for rate limiting. My understanding is that our current production setup is an Elastic Load Balancer (ELB) and a few nodes.
The Web Application Firewall (WAF) technology I was looking into using seems to not be available for that setup. To use it, we would need to either:
Of the two options, # 1 seems preferable as it is a more basic change and more in line with our needs. That being said, both would require coordination with LTS and I am unsure whether both are viable from the greater Harvard infrastructure norms. It is also possible that there is already a WAF in place in front of Harvard Dataverse that we could ask to add rules too.
To be honest I don't have that great of an understanding of our production setup and haven't been in direct communication with LTS, so I could use more input. Maybe we should avoid an AWS solution altogether and instead do something on each EC2 instance (e.g. on each box)?
cc: @landreev @kcondon @scolapasta @djbrooke
If we don't get to it sooner, this would be a good topic for our tech hour.
During tech hours @scolapasta mentioned a doc that @matthew-a-dunlap had created during a period of instability that has some good notes and it's called "Dataverse Production Needs": https://docs.google.com/document/d/1Ml7SQNrWU-28p7ceLpngrYlVUaIovkj2nW_CEdTit4Y/edit?usp=sharing
Need to revisit an application level solution that's not AWS only. Moving to backlog.
From the git hub search api link I posted at the top of this ticket:
Rate limit The Search API has a custom rate limit. For requests using Basic Authentication, OAuth, or client ID and secret, you can make up to 30 requests per minute. For unauthenticated requests, the rate limit allows you to make up to 10 requests per minute.
See the rate limit documentation for details on determining your current rate limit status.
Timeouts and incomplete results To keep the Search API fast for everyone, we limit how long any individual query can run. For queries that exceed the time limit, the API returns the matches that were already found prior to the timeout, and the response has the incomplete_results property set to true.
Reaching a timeout does not necessarily mean that search results are incomplete. More results might have been found, but also might not.
I think stopping it at the apache level would be great but not sure if we need to implement a Dataverse-side rate-limit-by-session solution and how that would be passed to Apache. Maybe there are out of the box Apache or bolt on front end solutions in content aware load balancers for instance. I know, I know, that's what AWS was.
@kcondon posted some research resources in Slack and they're probably worth sharing here too.
@donsizemore, @qqmyers, @poikilotherm, @pameyer -- any thoughts? (Also Kevin's idea to reach out to community developers. :dataverseman:)
Also, https://github.com/jzdziarski/mod_evasive and https://coderwall.com/p/eouy3g/using-mod_evasive-to-rate-limit-apache as more general DoS approach.
Started with this: https://stackoverflow.com/questions/131681/how-can-i-implement-rate-limiting-with-apache-requests-per-second
Hi all, just wanted to share with the devs that we're interested in this feature.
We're especially interested in rate-limiting on the /api/access/datafiles/{id} endpoint. With a large number of GET requests from multiple IP addresses (each with multiple GET requests) within a short time frame, it could really slow down the site significantly. Let me know if you would like clarification on this case study.
Also referencing the related Google Groups post here: https://groups.google.com/g/dataverse-community/c/qCUw2uZ1feE/m/vdCEQgigAQAJ.
The tech team will discuss and bring a well scoped issue to a future planning meeting.
@scolapasta - when you pick this up for discussion, one thing that @landreev mentioned is that it may be a good idea to check the number of locks a person has - for example a person can start a bunch of publishing requests, and the individual datasets are locked but what's to stop them from firing several thousand requests in parallel?
Since last week we are experiencing lots of problems with lots of request to the '/api/access/datafiles/{id}' endpoint. These download request are probably not malicious, but we don't know for sure of course.
Besides thinking about using something like mod_evasive, we also looked into our payara configuration, which might be tuned to give better performance. This blog https://blog.payara.fish/fine-tuning-payara-server-5-in-production contains most useful information, but I was wondering if there are some Dataverse specific tips available in the guides, or maybe it should be added?
Prio meeting with Stefano.
Top priority for upcoming sprint
Sizing:
This came up yet again, recently. The reason it hasn't gone anywhere in 8 years is that it's way too fat, an elephant-sized issue that's too broadly defined. We've gone through this cycle quite a few times - of talking about it during tech hours, giving it to somebody to research and investigate, etc. But it's hard to even talk about, when we are defining it like this, that we want to "throttle everything", the full spectrum of our incoming traffic - it's not clear where to start even. Our traffic is not uniform. It would be easy if we were only serving cat pictures (of roughly the same size) all day long. But our users' requests vary immensely in their impact on the system, plus we have different classes of users etc. In general, you need to know a lot about the specifics of our application and this makes adopting existing third party solutions difficult at least.
What I'm proposing is that instead of trying to re-visit this issue as a whole, we should just start chipping away at the problem by addressing certain specific cases of limiting excessive load that we can define and know how to address. I've proposed some, like detecting and blocking aggressive crawlers (basically what I do by hand occasionally; also blocking crawlers may be one area where some off the shelf solution may/should work); or limiting specific expensive activity on the user level (like a limit on how many files/data an unprivileged user can upload per hour). Features like this are in fact long overdue. And I'm convinced by now that it would be more productive to just work on them one clearly defined case at a time.
Sprint board review
(I can't wait until some of this is automated)
Sprint board review
There are few specific areas that have been identified where we can start working immediately. The list below is the first set of such issues catalogued as part of this spike, some old and some brand-new.
This new issue has been opened as a followup to the discussion with @siacus and @scolapasta, as a sensible area to focus on:
During recent discussions it was suggested that metering and limiting file uploads should also be handled under this umbrella, since uploads are a very serious part of the overall practical system load, and there seems to be an agreement that this needs to be addressed urgently. Another practical consideration is that file uploads are not handled through the command engine, and therefore will not be subject to limiting by the technology described in 1. above. A few issues have been opened for storage quotas and limits over the years. There is some overlap between them.
Add Apache-level solution for detecting bot/scripted or otherwise automated crawling, before it gets to the application:
Per feedback from @qqmyers, I'll run some quick practical analysis on the ActionLogRecord data in production, to see if any obvious results can be derived from it immediately, smoking guns/worst offenders, etc.
Actually, I'll add any useful stats from the prod. ActionLogRecord to the "command engine" issue (#9356).
Reviewed the new issues added - I think they look good and represent what we can first get done, in order to help with rate limiting. There may well be more to do after those, but let's get them working (I've gone ahead and added them to the Dataverse Dev column in the backlog board) and we can revisit after, as needed.
Grooming:
grooming:
Closing this, now that we have https://github.com/IQSS/dataverse/pull/10211 in progress.
@scolapasta Are you sure you wanted to close this one? Note that this spike was for being able to limit everything across the application; with the idea, I think, that more than one solution may be need in parallel, for different parts of the application. IQSS/dataverse#10211, and the corresponding issue are specifically for the Command Engine only.
I can see how an argument can be made that if there is anything potentially expensive that we want to ration, that's done bypassing the command system, then it could potentially be addressed by creating dedicated commands for all such things... But I still think that would need to be discussed to make sure we're not missing anything.
@landreev If there are other areas that we do need to ration, outside of the command system, then I'd vote for creating more specific actionable issues for it. This one here was in the dm-project and I do think we've made plenty of headway on different aspects and I think that accomplished the goal of "creat[ing] a list of actionable issues to start the effort". But if you feel otherwise and think there's something more we can do for this one specifically, that's fine too.
This ticket is a placeholder for general API rate and access limiting logic to better control the load placed on the service and provide options in case of system instability.
Rate limiting was mentioned during search api testing and github search api uses this concept too: https://developer.github.com/v3/search/
Limiting access might involve varying degrees of options: general api access on/off switch, per api, and/or whitelist/blacklist of ip addresses/ users. The last might be integrated with groups and permissions.
Update: additional terms for this: