Closed revelt closed 3 years ago
It's a really good point - right now the crawler is a tad aggressive :) Another potential idea on how to handle this one - I suspect most services that return an HTTP 429 may also return a Retry-After
header:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After
When that header is detected we could add the request to a queue that is specific to the subdomain, and then drain it in accordance with the retry guidance coming back from results. It sounds like a lot of fun to build :)
One simple idea might be to use https://github.com/sindresorhus/p-throttle from a same author as p-queue
. Then allow to configure throttling level per domain. This would propably be elegant solution for a case where your site does low number of links to rate limiting domain.
+1 for this -- 429s are currently failing our GH Actions job, when in reality 429 is an acceptable response here.
Huh, so that's an interesting thought I hadn't considered. I was originally thinking of stuffing 429s in a retry queue of sorts, trying to respect Retry-After
, etc....
I did some testing, and bad news. After you start to drown GitHub in requests, it will return 429s both for resources that exist, and ones that don't 🐸 I was hoping that a 429 would mean the resource actually exists, but it turns out they throw that upstream.
Yeah at this point, Sean Bean will probably tell me:
Our use-case here started out simple enough: "Let's make sure our website doesn't have any broken links!" …but obviously it gets nuanced once we start down the rabbit hole.
Thanks again for your help with this!
PS. github / npm / wikipedia link checks do work on the latest; I'm using linkinator linkinator ./dist --recurse --concurrency 1
— if anybody is still having 429
problems, limit the concurrency to one. Thank you Justin! 👍
Sites like Wikipedia throttle the incoming requests, yielding
429
error:It may or may not be a broken link.
Idea: what if we improved the algorithm to slow down for
429
throttling domains and tackle all429
links in a separate, second round, per-domain but slower?Imagine, our outgoing requests go as normal, respecting
--concurrency
value, but when it completes, it extracts all429
errors, groups them per throttling domain, waits a bit, then slowly tackles each link let's say 1 query per second (or slower), concurrently, per-throttling domain.Currently...
For example, I've got 1042 links and 11 of them linking to Wikipedia. If I set
--concurrency
to satisfy Wikipedia, let's say 2 seconds per request, it will take 1042 × 2 / 60 = 34 minutes — unbearable, considering it's for 1% of the links!If we implemented the feature, it would be 1031 × 0.01 + 11*2 = 32 seconds. Reasonable, considering current 100 req/sec. throttle takes 1042 × 0.01 = 10 seconds.
I can exclude Wikipedia via
--skip
, but we can automate this, can't we?What do you think?