Open buren opened 4 years ago
can you review my last 3 commits? https://github.com/chlorophyll-zz/wayback_archiver
lowered concurrency to 1 and put a sleep(5) in url_collector.
dont know wether url_collector is the right place. but the other method --url passes only a single url, where no rate limiting is required.
maybe this works too with a higher concurrency than 1.
I tried your version of wayback_archiver, @chlorophyll-zz , but it still operates with a default concurrency at 5 and does not "sleep". So it still gives a 429 after about 20 submits.
@snobjorn i now increased the sleep time to 5 seconds, to fix your specific problem. yes, before that i raised the concurrency to 5 and lowered the wait time to 2, because concurrency 5 and wait 2 were running without 429s.
you said the requests did not wait in betweeen, are you sure that you are using my fork?
here are the instructions to build and run my fork
git clone https://github.com/chlorophyll-zz/wayback_archiver
cd wayback_archiver
gem build wayback_archiver.gemspec
gem install wayback_archiver-1.3.0.gem
then run with
./~/.gem/ruby/2.7.0/bin/wayback_archiver
or
./~/.gem/ruby/2.6.0/bin/wayback_archiver
or wayback_archiver
if you have the ruby user bin (./~/.gem/ruby/2.6.0/bin/) in your path and gem installed the gem.
I started over and tried exactly what you wrote, @chlorophyll-zz , but it still pushed 5 links at a time, and does not wait in between.
i changed the default concurrency back to 1, increased the sleep time to 5 seconds. can you give me a log, when you can of the build too. thanks in advance.
Am Fr., 24. Jan. 2020 um 09:59 Uhr schrieb Snøbjørn < notifications@github.com>:
I started over and tried exactly what you wrote, @chlorophyll-zz https://github.com/chlorophyll-zz , but it still pushed 5 links at a time, and does not wait in between.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/buren/wayback_archiver/issues/32?email_source=notifications&email_token=AKNMCDZXZQ2TYV23NM5NMIDQ7KUW5A5CNFSM4JDS3FFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ2ERTI#issuecomment-578046157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKNMCD3BI4KGSLSPD2CRHF3Q7KUW5ANCNFSM4JDS3FFA .
Seems like they've introduced rate limiting i two steps
5 requests a minute is, to say the least, not great (see this wiki).
Will try to look at some mitigation options (updating default concurrency, perhaps add a sleep
call etc).
Mitigation options
UPDATE:
Difference between 200
and 429
:
HTTP 200, OK
headers:
{
"server": "nginx/1.15.8",
"date": "Tue, 28 Jan 2020 15:00:46 GMT",
"content-type": "text/html;charset=utf-8",
"transfer-encoding": "chunked",
"connection": "close",
"content-location": "/web/20200128150045/https://www.example.com/notsosecret/",
"set-cookie": "JSESSIONID=3AFB1D7EE70F9ED7BB7E02BEC3AA325C; Path=/; HttpOnly",
"x-archive-orig-link": "<https://www.example.com/wp-json/>; rel=\"https://api.w.org/\", <https://www.example.com/?p=1579>; rel=shortlink",
"x-archive-orig-strict-transport-security": "max-age=31536000; includeSubdomains;",
"x-archive-orig-vary": "User-Agent,Accept-Encoding",
"x-archive-guessed-charset": "UTF-8",
"x-archive-orig-server": "Apache",
"x-archive-orig-connection": "close",
"x-archive-orig-content-type": "text/html; charset=UTF-8",
"x-archive-orig-x-powered-by": "PleskLin",
"x-archive-orig-cache-control": "max-age=0, no-store",
"x-archive-orig-date": "Tue, 28 Jan 2020 15:00:46 GMT",
"x-app-server": "wwwb-app0",
"x-ts": "200",
"x-cache-key": "httpsweb.archive.org/save/https://www.example.com/global-medicinteknik/SE",
"x-page-cache": "MISS",
"x-location": "save-get"
}
HTTP 429, Too Many Requests
headers:
{
"server": "nginx/1.15.8",
"date": "Tue, 28 Jan 2020 15:00:48 GMT",
"content-type": "text/html",
"content-length": "487",
"connection": "close",
"etag": "\"5db9ab48-1e7\""
}
I have a good experience with a request every 5 seconds without a 429, was less than a week ago. I measured and i also able to request with 5 concurrent requests every 5 seconds. For users with a server infrastructure, it is no big deal to set up a deamon to scrape a list of pages. And for private users, the Save Page Now has a similar feature called archive outlinks.
I had some luck using the block executed for each URL to sleep between requests:
require 'wayback_archiver'
WaybackArchiver.concurrency = 1
WaybackArchiver.archive('example.com', strategy: :auto) do |result|
if result.success?
puts "Successfully archived: #{result.archived_url}"
else
puts "Error (HTTP #{result.code}) when archiving: #{result.archived_url}"
end
sleep(5) # sleep 5 seconds after each request
end
🔗 Here's how another similar-ish tool handles HTTP 429 – Too Many Requests
.
Wouldn't be that tricky to implement something similar.
5 requests a minute is, to say the least, not great (see this wiki).
Will try to look at some mitigation options (updating default concurrency, perhaps add a
sleep
call etc).
5 requests a minute is probably acceptable for many sites: that's 280 URLs an hour. If someone has fewer than a few thousand URLs which do not change on a daily basis, then why is this a major problem? It can run in a cron job overnight.
I have experimented with sleep(13), so as to be sure that it is certainly less than 5 per minute. This revealed a separate issue I will report, but wayback_archiver did get considerably further.
I put the sleep() in archive.rb:self.post, in the pool.post do loop. I suspect other people inserting sleep() discussed in this GitHub issue may have been adding it in a less useful place.
Dan
It appears the rate limiting is 15 requests/minute, with a 5 minute block for the IP address exceeding this: https://archive.org/details/toomanyrequests_20191110
Feature request: Is there a way to add a CLI parameter for the user to set the rate (number of pages submitted per minute) ?
The Internet Archive has started to more aggressively rate limit requests and after just a dozen or so requests (with the default concurrency setting
5
).After some testing we even get rate limiting with concurrency set to
1
.To fix this we have to implement a way to throttle requests in order to successfully submit all URLs.
🔗 Related to #22.