JustinBeckwith / linkinator

🐿 Scurry around your site and find all those broken links.
MIT License
1.03k stars 80 forks source link

feature request - address the 429 "too many requests" #179

Closed revelt closed 3 years ago

revelt commented 4 years ago

Sites like Wikipedia throttle the incoming requests, yielding 429 error:

Screenshot 2020-09-12 at 20 31 23

It may or may not be a broken link.

Idea: what if we improved the algorithm to slow down for 429 throttling domains and tackle all 429 links in a separate, second round, per-domain but slower?

Imagine, our outgoing requests go as normal, respecting --concurrency value, but when it completes, it extracts all 429 errors, groups them per throttling domain, waits a bit, then slowly tackles each link let's say 1 query per second (or slower), concurrently, per-throttling domain.

Currently...

For example, I've got 1042 links and 11 of them linking to Wikipedia. If I set --concurrency to satisfy Wikipedia, let's say 2 seconds per request, it will take 1042 × 2 / 60 = 34 minutes — unbearable, considering it's for 1% of the links!

If we implemented the feature, it would be 1031 × 0.01 + 11*2 = 32 seconds. Reasonable, considering current 100 req/sec. throttle takes 1042 × 0.01 = 10 seconds.

I can exclude Wikipedia via --skip, but we can automate this, can't we?

What do you think?

JustinBeckwith commented 4 years ago

It's a really good point - right now the crawler is a tad aggressive :) Another potential idea on how to handle this one - I suspect most services that return an HTTP 429 may also return a Retry-After header: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After

When that header is detected we could add the request to a queue that is specific to the subdomain, and then drain it in accordance with the retry guidance coming back from results. It sounds like a lot of fun to build :)

jvalkeal commented 3 years ago

One simple idea might be to use https://github.com/sindresorhus/p-throttle from a same author as p-queue. Then allow to configure throttling level per domain. This would propably be elegant solution for a case where your site does low number of links to rate limiting domain.

case commented 3 years ago

+1 for this -- 429s are currently failing our GH Actions job, when in reality 429 is an acceptable response here.

JustinBeckwith commented 3 years ago

Huh, so that's an interesting thought I hadn't considered. I was originally thinking of stuffing 429s in a retry queue of sorts, trying to respect Retry-After, etc....

I did some testing, and bad news. After you start to drown GitHub in requests, it will return 429s both for resources that exist, and ones that don't 🐸 I was hoping that a 429 would mean the resource actually exists, but it turns out they throw that upstream.

case commented 3 years ago

Yeah at this point, Sean Bean will probably tell me:

download

Our use-case here started out simple enough: "Let's make sure our website doesn't have any broken links!" …but obviously it gets nuanced once we start down the rabbit hole.

Thanks again for your help with this!

revelt commented 3 years ago

PS. github / npm / wikipedia link checks do work on the latest; I'm using linkinator linkinator ./dist --recurse --concurrency 1 — if anybody is still having 429 problems, limit the concurrency to one. Thank you Justin! 👍