buren / wayback_archiver

Ruby gem to send URLs to Wayback Machine
https://rubygems.org/gems/wayback_archiver
MIT License
57 stars 11 forks source link

Rate limiting – HTTP 429, Too Many Requests #32

Open buren opened 4 years ago

buren commented 4 years ago

The Internet Archive has started to more aggressively rate limit requests and after just a dozen or so requests (with the default concurrency setting 5).

After some testing we even get rate limiting with concurrency set to 1.

To fix this we have to implement a way to throttle requests in order to successfully submit all URLs.

🔗 Related to #22.

bartman081523 commented 4 years ago

can you review my last 3 commits? https://github.com/chlorophyll-zz/wayback_archiver

lowered concurrency to 1 and put a sleep(5) in url_collector.

dont know wether url_collector is the right place. but the other method --url passes only a single url, where no rate limiting is required.

maybe this works too with a higher concurrency than 1.

snobjorn commented 4 years ago

I tried your version of wayback_archiver, @chlorophyll-zz , but it still operates with a default concurrency at 5 and does not "sleep". So it still gives a 429 after about 20 submits.

bartman081523 commented 4 years ago

@snobjorn i now increased the sleep time to 5 seconds, to fix your specific problem. yes, before that i raised the concurrency to 5 and lowered the wait time to 2, because concurrency 5 and wait 2 were running without 429s.

you said the requests did not wait in betweeen, are you sure that you are using my fork?

here are the instructions to build and run my fork

git clone https://github.com/chlorophyll-zz/wayback_archiver
cd wayback_archiver
gem build wayback_archiver.gemspec
gem install wayback_archiver-1.3.0.gem

then run with ./~/.gem/ruby/2.7.0/bin/wayback_archiver or ./~/.gem/ruby/2.6.0/bin/wayback_archiver or wayback_archiver if you have the ruby user bin (./~/.gem/ruby/2.6.0/bin/) in your path and gem installed the gem.

snobjorn commented 4 years ago

I started over and tried exactly what you wrote, @chlorophyll-zz , but it still pushed 5 links at a time, and does not wait in between.

bartman081523 commented 4 years ago

i changed the default concurrency back to 1, increased the sleep time to 5 seconds. can you give me a log, when you can of the build too. thanks in advance.

Am Fr., 24. Jan. 2020 um 09:59 Uhr schrieb Snøbjørn < notifications@github.com>:

I started over and tried exactly what you wrote, @chlorophyll-zz https://github.com/chlorophyll-zz , but it still pushed 5 links at a time, and does not wait in between.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/buren/wayback_archiver/issues/32?email_source=notifications&email_token=AKNMCDZXZQ2TYV23NM5NMIDQ7KUW5A5CNFSM4JDS3FFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ2ERTI#issuecomment-578046157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKNMCD3BI4KGSLSPD2CRHF3Q7KUW5ANCNFSM4JDS3FFA .

buren commented 4 years ago

Seems like they've introduced rate limiting i two steps

5 requests a minute is, to say the least, not great (see this wiki).

Will try to look at some mitigation options (updating default concurrency, perhaps add a sleep call etc).

Mitigation options

UPDATE:

Difference between 200 and 429:

HTTP 200, OK headers:

{
  "server": "nginx/1.15.8",
  "date": "Tue, 28 Jan 2020 15:00:46 GMT",
  "content-type": "text/html;charset=utf-8",
  "transfer-encoding": "chunked",
  "connection": "close",
  "content-location": "/web/20200128150045/https://www.example.com/notsosecret/",
  "set-cookie": "JSESSIONID=3AFB1D7EE70F9ED7BB7E02BEC3AA325C; Path=/; HttpOnly",
  "x-archive-orig-link": "<https://www.example.com/wp-json/>; rel=\"https://api.w.org/\", <https://www.example.com/?p=1579>; rel=shortlink",
  "x-archive-orig-strict-transport-security": "max-age=31536000; includeSubdomains;",
  "x-archive-orig-vary": "User-Agent,Accept-Encoding",
  "x-archive-guessed-charset": "UTF-8",
  "x-archive-orig-server": "Apache",
  "x-archive-orig-connection": "close",
  "x-archive-orig-content-type": "text/html; charset=UTF-8",
  "x-archive-orig-x-powered-by": "PleskLin",
  "x-archive-orig-cache-control": "max-age=0, no-store",
  "x-archive-orig-date": "Tue, 28 Jan 2020 15:00:46 GMT",
  "x-app-server": "wwwb-app0",
  "x-ts": "200",
  "x-cache-key": "httpsweb.archive.org/save/https://www.example.com/global-medicinteknik/SE",
  "x-page-cache": "MISS",
  "x-location": "save-get"
}

HTTP 429, Too Many Requests headers:

{
  "server": "nginx/1.15.8",
  "date": "Tue, 28 Jan 2020 15:00:48 GMT",
  "content-type": "text/html",
  "content-length": "487",
  "connection": "close",
  "etag": "\"5db9ab48-1e7\""
}
bartman081523 commented 4 years ago

I have a good experience with a request every 5 seconds without a 429, was less than a week ago. I measured and i also able to request with 5 concurrent requests every 5 seconds. For users with a server infrastructure, it is no big deal to set up a deamon to scrape a list of pages. And for private users, the Save Page Now has a similar feature called archive outlinks.

delucis commented 3 years ago

I had some luck using the block executed for each URL to sleep between requests:

require 'wayback_archiver'

WaybackArchiver.concurrency = 1
WaybackArchiver.archive('example.com', strategy: :auto) do |result|
  if result.success?
    puts "Successfully archived: #{result.archived_url}"
  else
    puts "Error (HTTP #{result.code}) when archiving: #{result.archived_url}"
  end
  sleep(5) # sleep 5 seconds after each request
end
buren commented 3 years ago

🔗 Here's how another similar-ish tool handles HTTP 429 – Too Many Requests.

Wouldn't be that tricky to implement something similar.

danshearer commented 2 years ago

5 requests a minute is, to say the least, not great (see this wiki).

Will try to look at some mitigation options (updating default concurrency, perhaps add a sleep call etc).

5 requests a minute is probably acceptable for many sites: that's 280 URLs an hour. If someone has fewer than a few thousand URLs which do not change on a daily basis, then why is this a major problem? It can run in a cron job overnight.

I have experimented with sleep(13), so as to be sure that it is certainly less than 5 per minute. This revealed a separate issue I will report, but wayback_archiver did get considerably further.

I put the sleep() in archive.rb:self.post, in the pool.post do loop. I suspect other people inserting sleep() discussed in this GitHub issue may have been adding it in a less useful place.

Dan

dbader13 commented 1 year ago

It appears the rate limiting is 15 requests/minute, with a 5 minute block for the IP address exceeding this: https://archive.org/details/toomanyrequests_20191110

Feature request: Is there a way to add a CLI parameter for the user to set the rate (number of pages submitted per minute) ?