freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
554 stars 152 forks source link

Implement throttling on critical views #746

Closed mlissner closed 7 years ago

mlissner commented 7 years ago

Some ass is crawling the hell out of us right now and we need to get them redirected to the API. It looks like they're only crawling the dockets, so I'm going to start by fixing that by using the library here:

https://django-ratelimit.readthedocs.io/en/v1.0.0/index.html

I guess somebody was going to do it. We've got some honeypots in the data that we can use someday to figure out who's the culprit, but in the meantime, this is annoying. We do have an API.

mlissner commented 7 years ago

Reopening. They're using about 5000 IPs that I've logged so far, so blocking by IP address isn't going to work. I didn't expect them to be that crazy.

anseljh commented 7 years ago

That's... a lot of IPs. Any apparent pattern to them?

mlissner commented 7 years ago

I looked up a few, and got a variety of hosts. I just blocked them all using iptables. That slowed things down, but the bots seem to have found new IPs or there are just more than I imagined at first. I'm going to do another round of blocking and see if it helps. If not, I'll have to automate it somehow. fail2ban?

mlissner commented 7 years ago

Here's a file with the first ~5,000 or so. bad_ips.uniq.txt

anseljh commented 7 years ago

All over the place. Botnet?

2017-10-31-30ips

mlissner commented 7 years ago

No, the requests are enumerating the URLs. Here's a sample of the apache log:

178.205.101.112 - - [31/Oct/2017:12:16:45 -0700] "GET /docket/5133274/dowdy-v-pappas/ HTTP/1.1" 200 9212 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
189.5.76.216 - - [31/Oct/2017:12:16:45 -0700] "GET /docket/5133864/dowdy-v-pappas/ HTTP/1.1" 200 8641 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
187.67.57.169 - - [31/Oct/2017:12:16:45 -0700] "GET /docket/5133998/dowdy-v-pappas/ HTTP/1.1" 200 8524 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
179.219.86.248 - - [31/Oct/2017:12:16:45 -0700] "GET /docket/5133695/dowdy-v-pappas/ HTTP/1.1" 200 8975 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
187.23.244.5 - - [31/Oct/2017:12:16:45 -0700] "GET /docket/5133572/dowdy-v-pappas/ HTTP/1.1" 200 9081 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
177.6.127.106 - - [31/Oct/2017:12:16:45 -0700] "GET /docket/5133508/dowdy-v-pappas/ HTTP/1.1" 200 9608 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
84.202.32.240 - - [31/Oct/2017:12:16:45 -0700] "GET /docket/5133413/dowdy-v-pappas/ HTTP/1.1" 200 9490 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
92.103.199.139 - - [31/Oct/2017:12:16:45 -0700] "GET /docket/5133898/dowdy-v-pappas/ HTTP/1.1" 200 8654 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
61.6.139.95 - - [31/Oct/2017:12:16:45 -0700] "GET /docket/5133182/dowdy-v-pappas/ HTTP/1.1" 200 8708 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
189.62.46.16 - - [31/Oct/2017:12:16:46 -0700] "GET /docket/5133980/dowdy-v-pappas/ HTTP/1.1" 200 8641 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
78.38.67.210 - - [31/Oct/2017:12:16:46 -0700] "GET /docket/5133958/dowdy-v-pappas/ HTTP/1.1" 200 8603 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
191.17.255.94 - - [31/Oct/2017:12:16:47 -0700] "GET /docket/5133942/dowdy-v-pappas/ HTTP/1.1" 200 8628 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
177.192.71.184 - - [31/Oct/2017:12:16:47 -0700] "GET /docket/5133892/dowdy-v-pappas/ HTTP/1.1" 200 8871 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
14.162.103.168 - - [31/Oct/2017:12:16:47 -0700] "GET /docket/5133794/dowdy-v-pappas/ HTTP/1.1" 200 8842 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
191.188.8.144 - - [31/Oct/2017:12:16:47 -0700] "GET /docket/5133938/dowdy-v-pappas/ HTTP/1.1" 200 9086 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
187.20.189.199 - - [31/Oct/2017:12:16:47 -0700] "GET /docket/5133979/dowdy-v-pappas/ HTTP/1.1" 200 8796 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
179.155.162.205 - - [31/Oct/2017:12:16:47 -0700] "GET /docket/5130696/dowdy-v-pappas/ HTTP/1.1" 404 20023 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
187.38.142.94 - - [31/Oct/2017:12:16:47 -0700] "GET /docket/5133947/dowdy-v-pappas/ HTTP/1.1" 200 13828 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

They're just enumerating the ID in the docket URLs.

mlissner commented 7 years ago

Another round of 1,000 or so IPs, and no real progress. I guess it's time to try fail2ban or something else. The good news is I almost had jury duty today.

anseljh commented 7 years ago

Right. But you could rent a botnet or "legit" proxy service to do that. I ran your file through MaxMind, and it's 88% residential IPs: batch-request-c0e63314-be73-11e7-906c-3aa0cc0f70a7.zip

mlissner commented 7 years ago

Oh, right. Forgot about rent-a-bot. I had hoped to never have to deal with this. Nice work with MaxMind.

voutilad commented 7 years ago

Oh wow, are all the requests using the same very old user agent string?

If I remember correctly, you've got Apache as the web server in front of Django, right? Maybe you can use the second approach listed here https://serverfault.com/questions/690870/iptables-block-user-agent#690877 to drop based on those (probably garbage) user agents.

mlissner commented 7 years ago

Good call. That'll buy time anyway.

voutilad commented 7 years ago

Yeah that won't stop them from saturating your network link, but at least django and postgresql might stop sweating.

I used some website to lookup a possible interpretation of that user agent and it's like 32-bit WinXP Firefox or something ludicrous. Seems it's got a track record for being associated with malcontents: http://blog.thewebsitepeople.org/2014/06/http-ddos-mozilla4-0-compatible-msie-6-0-windows-nt-5-1-sv1/

mlissner commented 7 years ago

OK, this is implemented and seems to be sort of working:

 50     # Block bad bots
 51     RewriteEngine On
 52     RewriteCond %{HTTP_USER_AGENT}   ^Sosospider [OR]
 53     RewriteCond %{HTTP_USER_AGENT} "^Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.1; SV1\)$"
 54     RewriteRule ^/   -   [F]

That eases up postgres, but the traffic is still flying in. I think the site performance is back, so I'm going to stop here and hope that the botnet moves on. If getting endless 403's doesn't bother them, I suppose I'll reopen this and we can come up with something else!

Thanks everybody.

mlissner commented 7 years ago

Just for record keeping purposes, this hit us about 220k times in the span of four hours, so about 14 hits/s.

anseljh commented 7 years ago

Just going to leave this here... eBay, Inc. v. Bidder's Edge, Inc., 100 F. Supp. 2d 1058 (N.D. Cal. 2000)

mlissner commented 7 years ago

Grand total, 1.2M hits from this yesterday, but I'm happy to say it's moved on as of today.

mlissner commented 7 years ago

And...it's back with a new User-Agent string:

Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

I've blocked it as well.

johnhawkinson commented 7 years ago

You're going to run out of tricks in this arms race real soon now, so you better come up with something better... Does redirecting them to something that does not return quickly help to slow it down?

I guess one of the questions is whether this is malicious or somehow a misunderstanding about what is reasonable behavior for automation. If the latter, redirecting to a "PLEASE EMAIL MIKE, YOU ARE DESTROYING OUR SITE" web page might help.

mlissner commented 7 years ago

I think it actually has nothing to do with "us". I think it just looks for URLs that are enumerable and goes nuts.

Right now I'm redirecting it to http://127.0.0.1/ in hopes that would slow them down, but it doesn't seem to do much.

The only other solutions I know of are:

johnhawkinson commented 7 years ago

Well, I don't know about the distributed botnet variety, but many of these things run serially, such that they won't try the next URL until the first URL returns. So if it takes 100 seconds to return rather than 2 seconds to return, you've cut your load by a factor of 50.

Perhaps a 50-fold reduction isn't enough to make a dent in it when there are 1,000 attackers though.

I am skeptical that redirecting to localhost is effective or even "appropriate" (mumble mumble when automated scraping blocks do that to me, it's just annoying. Ban the scrape, ban it with a message, rate limit it, but don't make it collect bad data), but it probably does no harm.

johnhawkinson commented 7 years ago

Also, I assume they're not limited to "dowdy-v-pappas" because of course that's another way to block...

When there's a breather, I guess it might be worth thinking about some of the architectural choices. What if URLs weren't enumerable? What if the string was validated against the docket number? Maybe the former would really help...

mlissner commented 7 years ago

Yeah, I was thinking about using the string + docket number for URLs. It'd break some things, but it might help if we started returning 404's for all of these. Right now, I think it's apache that's suffering, so I'm working to make it happier.

johnhawkinson commented 7 years ago

If apache is suffering, that's a function of connection rate, not of the data served in response (right?). So returning 404s would happen faster, which would mean that the bots would be able to make more connections per unit time which would increase your load. Right?

(There are a lot of assumptions here, and they may well be wrong or backwards, but the point is to observe that returning 404s might not help and might actually hurt). Unless of course you think hitting a 404 will actually make an instance of the bot stop in its tracks.

I would instead explore returning very little data really slowly. And also trying to figure out a more plausible reason for this than that it's just hunting through enumerable URLs.

mlissner commented 7 years ago

Yeah, 404's would be worst because it'd have to check the DB before blocking the botnet, BUT it'd make enumeration no longer possible and maybe the bot would leave.

mlissner commented 7 years ago

Jackass botnet left while I was reconfiguring apache. Now I'll never know if it helped. Still, for posterity, I changed MaxRequestWorkers from 150 to 450 in /etc/apache2/mods-enabled/mpm_worker.conf according to: https://httpd.apache.org/docs/2.4/mod/worker.html

mlissner commented 7 years ago

OK, now they're back with a modern UA string, so that strategy is dead. I'm surprised they didn't do this at the outset. This also indicates that they're actually scraping CourtListener on purpose. Anyway, I'm going to make the URLs un-enumerable. That should do it.

johnhawkinson commented 7 years ago

This non-enumerability breaks historical URLs when parties change. In D.Mass's 1:17-cv-11577-PBS, the lead defendant was terminated/substituted/replaced (because it's a §2241 case and the REAL ID Act...) from Antone Moniz to Steven Souza in D.E. 39 (which is, incidently, not in RECAP — because of ECF free looks).

The historical URL which easily was bookmarked was https://www.courtlistener.com/docket/6159413/rombot-v-moniz/ and now it doesn't work. It's not OK to break bookmarked URLs.

The new URL is https://www.courtlistener.com/docket/6159413/rombot-v-souza/ after the new substitute defendant.

mlissner commented 7 years ago

I created a new ticket for that issue, John, #753. That shouldn't be happening.