cocrawler / cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Apache License 2.0
158 stars 31 forks source link

myrequests.py gets stuck in a loop if the response status is always 429, 500, 502, 503, 504, 509 #25

Closed Medstaar closed 2 years ago

Medstaar commented 2 years ago

OS: Windows 10 Python 3.8.5 cdx_toolkit 0.9.34

I'm having an issue where the CDX toolkit get's stuck in a loop and prints out cdx_toolkit.myrequests myrequests.py: 62 : retrying after 1s for 500 constantly. I've tracked this down to this line in the myrequests.py class. If I am reading this correctly, if the response status is always one of 429, 500, 502, 503, 504, 509, you will be stuck in this retry loop.

I suggest that after line 62 we break out of the loop if the number of retires is greater than 5.

wumpus commented 2 years ago

That's intentional. Common Crawl is having an outage. You'll note that after 5 cycles it prints a visible-by-default warning. I added that after a previous complaint that a short commoncrawl outage crashed their long-running job.

I have notified the Common Crawl engineer, he works German hours so might not fix it for 12 hours.

wumpus commented 2 years ago

Once CC stops being overloaded, I've added code that makes it clear that the 500 status results are actually 503 "slow down" from Amazon for the actual files of the index.

Medstaar commented 2 years ago

Thanks for the clarification @wumpus, didn't realize there was an outage. I will close the ticket and try again to see if it's working okay