calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

Amazon blocking and retrying failed requests #110

Open BigDatalex opened 1 year ago

BigDatalex commented 1 year ago

There are two ways we notice request blocking by amazon:

  1. Returns 200 status code, but the actual content of the page is asking to solve some puzzle, in order to identify as a human.
  2. Returns 503 status code

The first thing usually happens before getting the 503 errors. Within my work of the latest PR #109 I noticed that the blocking is temporarily and not permanently. So after some 503 errors for some start URLs we actually retrieve again 200, that actually include the expected page content.

Within the PR #100 the Retry Middleware was disabled for Amazon, because it interferes with the custom AmazonSchedulerMiddleware. It would be great to add a Retry Middleware back in, which tries the failed requests again at some later point. The easiest might be to only schedule those requests again that returned the 503 errors, the other requests would need some additional processing to check for the unexpected page content.