Imageomics / cautious-robot

Simple images from CSV downloader that runs and records checksums on downloaded image folder.
MIT License
2 stars 0 forks source link

Update retry algorithm to be more robust #6

Open egrace479 opened 3 months ago

egrace479 commented 3 months ago

That being said, retry algorithms (at least robust ones) for internet protocols are normally written with an exponential escalation of wait time (such as 1, 2, 4, 8, 16, 32, .. seconds). In that case, a user may want to specify at which point to give up and log a failure, for example --max-retries and/or --max-wait.

_Originally posted by @hlapp in https://github.com/Imageomics/cautious-robot/pull/1#discussion_r1626575092_

reply: It's waiting after a failed attempt if the response is any of the following: 429, 500, 502, 503, 504. It doesn't have a wait between successful downloads. It has a max number of times to retry on the designated responses, but otherwise just logs the response in the error log (along with the index, filename, and url).

Setting a maximum wait time on a request would probably be a good idea as well. urllib3.request seems to handle much of this when also passed a Retry object. @thompsonmj had also suggested HTTPAdapter as an option that also uses Retry.

Seems reasonable to use HTTPAdapter, since it's already using requests. Must also consider streaming interruption, as noted here.

johnbradley commented 1 month ago

The request HTTPAdapter with the urllib3 Retry strategy looks good for some of the retry needs. The streaming interruption will still need to be handled separately though.

johnbradley commented 1 month ago

Sometimes when downloading files we end up reaching a threshold where our IP address gets blocked for a while by a remote server. In that case you typically have to wait for a few hours. I wouldn't expect or want the command to wait in this scenario. For that scenario can we re-run the cautious-robot command and have it skip already downloaded images?

egrace479 commented 1 month ago

Sometimes when downloading files we end up reaching a threshold where our IP address gets blocked for a while by a remote server. In that case you typically have to wait for a few hours. I wouldn't expect or want the command to wait in this scenario. For that scenario can we re-run the cautious-robot command and have it skip already downloaded images?

Right now I believe it relies on adjusting the start index to avoid re-downloading the image. However, I could add a line here checking for the image:

if os.path.exists(image_dir_path/image_name):
    continue