Spike: Revisit the issue of adding rate limiting logic to the application, create a list of actionable issues to start the effort.

kcondon commented 9 years ago

This ticket is a placeholder for general API rate and access limiting logic to better control the load placed on the service and provide options in case of system instability.

Rate limiting was mentioned during search api testing and github search api uses this concept too: https://developer.github.com/v3/search/

Limiting access might involve varying degrees of options: general api access on/off switch, per api, and/or whitelist/blacklist of ip addresses/ users. The last might be integrated with groups and permissions.

Update: additional terms for this:

API throttling

pdurbin commented 6 years ago

For Harvard Dataverse we have been talking about investigating rate limiting solutions offered by AWS and I just pushed b1b703a to mention the new "Rate-Based Rules" offering that's part of AWS WAF (Web Application Firewall). This blog post provides a good overview: https://aws.amazon.com/blogs/aws/protect-web-sites-services-using-rate-based-rules-for-aws-waf/

pdurbin commented 6 years ago

At standup this morning I inquired if there is any specific technical plan or approach and got feedback that we are fine with an AWS-specific solution for now so I went ahead and made pull request IQSS/dataverse#4693 based on the commit I mentioned above and moved this issue to code review at https://waffle.io/IQSS/dataverse

djbrooke commented 6 years ago

Thanks @pdurbin. I re-titled the issue to reflect that this is AWS-specific. We'll want a general solution at some point, but I think it's good to get this small chunk tested and out in a release.

djbrooke commented 6 years ago

Talked after standup this morning. The approach is good, but we need some boxes from LTS to test. @landreev will look into this with LTS. @djbrooke will get involved if a credit card or something is needed :)

landreev commented 6 years ago

I sent a note to LTS:

We've been thinking about creating a test AWS setup that would mock the production; for testing new releases before they go into production.

Something like a low-power instance or two. And we specifically want it to sit behind an ELB - in order to be able to test load-balancing and rate-limiting mechanisms. (This is one thing we have no way of testing as of now).

Is this something you could help us setting up, or would you recommend that we just set it up on our own?

(by the time I hit send I kinda felt like I was maybe pushing it with them... well, if that's the case they'll tell us to do it ourselves and we will. but I figured I'd ask)

djbrooke commented 6 years ago

Meeting with LTS on Wednesday, will discuss.

landreev commented 6 years ago

@matthew-a-dunlap The test cluster is made up of 2 app nodes: dvn-cloud-dev-1.lts.harvard.edu dvn-cloud-dev-2.lts.harvard.edu I created a shell account for you, with the username mdunlap an sudo powers. I'll slack the password to you. The elb for the cluster is https://dvn-dev.lts.harvard.edu/

Both nodes are using the database on dvn-cloud-dev-1.

matthew-a-dunlap commented 6 years ago

This story is mostly blocked until we hear back from LTS about access to the web console, they only provided us access to the boxes themselves.

I'll can do some deeper research into web application firewall in the meantime.

landreev commented 6 years ago

I just emailed Tom (Scorpa) directly.

landreev commented 6 years ago

Got this from Tom:

I'm assuming access to enable/disable nodes in the ELB? Yes, that is doable, I will ask Sharon for the sign off when she's back next week.

(I suggested to get you two directly in touch too)

matthew-a-dunlap commented 6 years ago

I'm starting to investigate more deeply the AWS approach we could take for rate limiting. My understanding is that our current production setup is an Elastic Load Balancer (ELB) and a few nodes.

The Web Application Firewall (WAF) technology I was looking into using seems to not be available for that setup. To use it, we would need to either:

Switch from the older basic ELB technology to the newer Application Load Balancer (ALB) https://www.sumologic.com/blog/amazon-web-services/aws-elb-alb/
Put our current ELB behind CloudFront and put the WAF on CloudFront (https://www.cdnplanet.com/cdns/cloudfront/). CloudFront would be overkill for our architecture but we may be able to use it simply.

Of the two options, # 1 seems preferable as it is a more basic change and more in line with our needs. That being said, both would require coordination with LTS and I am unsure whether both are viable from the greater Harvard infrastructure norms. It is also possible that there is already a WAF in place in front of Harvard Dataverse that we could ask to add rules too.

To be honest I don't have that great of an understanding of our production setup and haven't been in direct communication with LTS, so I could use more input. Maybe we should avoid an AWS solution altogether and instead do something on each EC2 instance (e.g. on each box)?

cc: @landreev @kcondon @scolapasta @djbrooke

matthew-a-dunlap commented 6 years ago

If we don't get to it sooner, this would be a good topic for our tech hour.

pdurbin commented 6 years ago

During tech hours @scolapasta mentioned a doc that @matthew-a-dunlap had created during a period of instability that has some good notes and it's called "Dataverse Production Needs": https://docs.google.com/document/d/1Ml7SQNrWU-28p7ceLpngrYlVUaIovkj2nW_CEdTit4Y/edit?usp=sharing

djbrooke commented 6 years ago

Need to revisit an application level solution that's not AWS only. Moving to backlog.

kcondon commented 6 years ago

From the git hub search api link I posted at the top of this ticket:

Rate limit The Search API has a custom rate limit. For requests using Basic Authentication, OAuth, or client ID and secret, you can make up to 30 requests per minute. For unauthenticated requests, the rate limit allows you to make up to 10 requests per minute.

See the rate limit documentation for details on determining your current rate limit status.

Timeouts and incomplete results To keep the Search API fast for everyone, we limit how long any individual query can run. For queries that exceed the time limit, the API returns the matches that were already found prior to the timeout, and the response has the incomplete_results property set to true.

Reaching a timeout does not necessarily mean that search results are incomplete. More results might have been found, but also might not.

kcondon commented 6 years ago

I think stopping it at the apache level would be great but not sure if we need to implement a Dataverse-side rate-limit-by-session solution and how that would be passed to Apache. Maybe there are out of the box Apache or bolt on front end solutions in content aware load balancers for instance. I know, I know, that's what AWS was.

djbrooke commented 5 years ago

Retitled to reflect a general solution for rate limiting, not one focused just on APIs and not one that's AWS-only
Investigate how to implement limiting when there are many duplicate calls to the same resource (page or same API call) and/or from specific IPs. We could add a filter and add some structure in the DB for tracking this.
@landreev's usual investigations involve the server log and not the Apache logs
Include some messaging about contacting the support address for the application
We could investigate cutting off things at the Apache level (or load balancer) so they don't start to drag down Glassfish

mheppler commented 5 years ago

@kcondon posted some research resources in Slack and they're probably worth sharing here too.

@donsizemore, @qqmyers, @poikilotherm, @pameyer -- any thoughts? (Also Kevin's idea to reach out to community developers. :dataverseman:)

kcondon commented 5 years ago

Also, https://github.com/jzdziarski/mod_evasive and https://coderwall.com/p/eouy3g/using-mod_evasive-to-rate-limit-apache as more general DoS approach.

Started with this: https://stackoverflow.com/questions/131681/how-can-i-implement-rate-limiting-with-apache-requests-per-second

eunices commented 3 years ago

Hi all, just wanted to share with the devs that we're interested in this feature.

We're especially interested in rate-limiting on the /api/access/datafiles/{id} endpoint. With a large number of GET requests from multiple IP addresses (each with multiple GET requests) within a short time frame, it could really slow down the site significantly. Let me know if you would like clarification on this case study.

Also referencing the related Google Groups post here: https://groups.google.com/g/dataverse-community/c/qCUw2uZ1feE/m/vdCEQgigAQAJ.

djbrooke commented 3 years ago

The tech team will discuss and bring a well scoped issue to a future planning meeting.

djbrooke commented 3 years ago

@scolapasta - when you pick this up for discussion, one thing that @landreev mentioned is that it may be a good idea to check the number of locks a person has - for example a person can start a bunch of publishing requests, and the individual datasets are locked but what's to stop them from firing several thousand requests in parallel?

PaulBoon commented 2 years ago

Since last week we are experiencing lots of problems with lots of request to the '/api/access/datafiles/{id}' endpoint. These download request are probably not malicious, but we don't know for sure of course.

Besides thinking about using something like mod_evasive, we also looked into our payara configuration, which might be tuned to give better performance. This blog https://blog.payara.fish/fine-tuning-payara-server-5-in-production contains most useful information, but I was wondering if there are some Dataverse specific tips available in the guides, or maybe it should be added?

mreekie commented 1 year ago

Prio meeting with Stefano.

Moved from Dataverse Team Backlog to ordered backlog

mreekie commented 1 year ago

Top priority for upcoming sprint

mreekie commented 1 year ago

Sizing:

The first step here is a spike. time limited to a sprint
We don't have a nailed down approach to this though there has been some research and discussion.
As part of the spike, could include a tech hour session.
This could be big enough that it becomes a "deliverable"

landreev commented 1 year ago

This came up yet again, recently. The reason it hasn't gone anywhere in 8 years is that it's way too fat, an elephant-sized issue that's too broadly defined. We've gone through this cycle quite a few times - of talking about it during tech hours, giving it to somebody to research and investigate, etc. But it's hard to even talk about, when we are defining it like this, that we want to "throttle everything", the full spectrum of our incoming traffic - it's not clear where to start even. Our traffic is not uniform. It would be easy if we were only serving cat pictures (of roughly the same size) all day long. But our users' requests vary immensely in their impact on the system, plus we have different classes of users etc. In general, you need to know a lot about the specifics of our application and this makes adopting existing third party solutions difficult at least.

What I'm proposing is that instead of trying to re-visit this issue as a whole, we should just start chipping away at the problem by addressing certain specific cases of limiting excessive load that we can define and know how to address. I've proposed some, like detecting and blocking aggressive crawlers (basically what I do by hand occasionally; also blocking crawlers may be one area where some off the shelf solution may/should work); or limiting specific expensive activity on the user level (like a limit on how many files/data an unprivileged user can upload per hour). Features like this are in fact long overdue. And I'm convinced by now that it would be more productive to just work on them one clearly defined case at a time.

mreekie commented 1 year ago

Sprint board review

I'm pulling this one back off onto the sprint backlog, pending an OK from Stefano.
The reasons are:
- This may need additional discussion
- I added an extra 69 points to this sprint over the team estimate of 400.

(I can't wait until some of this is automated)

mreekie commented 1 year ago

Sprint board review

Resized as a time bounded spike.
Size 33.
Added back on the sprint.

landreev commented 1 year ago

There are few specific areas that have been identified where we can start working immediately. The list below is the first set of such issues catalogued as part of this spike, some old and some brand-new.

This new issue has been opened as a followup to the discussion with @siacus and @scolapasta, as a sensible area to focus on:
- IQSS/dataverse#9356
During recent discussions it was suggested that metering and limiting file uploads should also be handled under this umbrella, since uploads are a very serious part of the overall practical system load, and there seems to be an agreement that this needs to be addressed urgently. Another practical consideration is that file uploads are not handled through the command engine, and therefore will not be subject to limiting by the technology described in 1. above. A few issues have been opened for storage quotas and limits over the years. There is some overlap between them.
- IQSS/dataverse#8549
- IQSS/dataverse#7829
- IQSS/dataverse#4339
- IQSS/dataverse#3939 I added a new issue for adding a generic quota check mechanism to the file creation pipeline that is narrowly defined and could be implemented first, but will then allow us to address the specific quota cases requested in the issues above:
- IQSS/dataverse#9361
Add Apache-level solution for detecting bot/scripted or otherwise automated crawling, before it gets to the application:
- IQSS/dataverse#9359

landreev commented 1 year ago

Per feedback from @qqmyers, I'll run some quick practical analysis on the ActionLogRecord data in production, to see if any obvious results can be derived from it immediately, smoking guns/worst offenders, etc.

landreev commented 1 year ago

Actually, I'll add any useful stats from the prod. ActionLogRecord to the "command engine" issue (#9356).

scolapasta commented 1 year ago

Reviewed the new issues added - I think they look good and represent what we can first get done, in order to help with rate limiting. There may well be more to do after those, but let's get them working (I've gone ahead and added them to the Dataverse Dev column in the backlog board) and we can revisit after, as needed.

mreekie commented 1 year ago

Grooming:

This is closed but it is the start of a dev effort to revisit rate limiting issues.
Added a deliverable label:

mreekie commented 1 year ago

grooming:

This needs to be looked at again.
For now I put it in the backlog and added the deliverable label.

scolapasta commented 10 months ago

Closing this, now that we have https://github.com/IQSS/dataverse/pull/10211 in progress.

landreev commented 10 months ago

@scolapasta Are you sure you wanted to close this one? Note that this spike was for being able to limit everything across the application; with the idea, I think, that more than one solution may be need in parallel, for different parts of the application. IQSS/dataverse#10211, and the corresponding issue are specifically for the Command Engine only.

I can see how an argument can be made that if there is anything potentially expensive that we want to ration, that's done bypassing the command system, then it could potentially be addressed by creating dedicated commands for all such things... But I still think that would need to be discussed to make sure we're not missing anything.

scolapasta commented 10 months ago

@landreev If there are other areas that we do need to ration, outside of the command system, then I'd vote for creating more specific actionable issues for it. This one here was in the dm-project and I do think we've made plenty of headway on different aspects and I think that accomplished the goal of "creat[ing] a list of actionable issues to start the effort". But if you feel otherwise and think there's something more we can do for this one specifically, that's fine too.

IQSS / dataverse-pm

Spike: Revisit the issue of adding rate limiting logic to the application, create a list of actionable issues to start the effort. #23