Automatic retries - Githubissues

ESPRI-Mod / synda

ESGF Downloader (this is a deprecated repository, the tool has now moved to https://github.com/ESGF/esgf-download)

https://espri-mod.github.io/synda/

21 stars 11 forks source link

Automatic retries #74

Closed painter1 closed 2 years ago

painter1 commented 7 years ago

In a large download job, i.e. with hundreds of files, it is almost inevitable that a few downloads will fail. These should be automatically retried. It shouldn't be necessary to type "synda retry". But a failed download should not be retried immediately. Wait for the rest of the job finishes, or until a few minutes have passed, whichever comes last.

It also would be helpful if "synda retry" took an optional argument which specified which job to retry, or even which file.

glevava commented 7 years ago

synda retry just changes the download status of ALL the pending files from "error" to "waiting". Indeed synda retry is "global" and cannot retries particular download (maybe a feature to keep in mind).

As a workaround at IPSL, we add synda retry in a cronjob to periodically retry failed downloads.

An automated retrying cannot be desirable especially in the case of ESGF datanodes down for long time. Download errors are often due to unreachable servers, retrying download immediatly won't solve the issue and could lead synda to focus on a download failing again and again.

In addition, to analyse failed downloads allows us to request the datanode managers about potential issues which is a recurring situation.

painter1 commented 7 years ago

Five years ago, I was doing production-scale replication, often with thousands of files in "waiting" mode at any one time. I saw that there would be an occasional failure from almost every server (*). Automatic retries saved me a lot of time. They were done on a per-download-job basis. After everything had been tried once, there would be a retry cycle in which the failures were retried. Then another retry cycle, and another, until no data came. Then the script would give up. This ended quickly when the server was down, which in any case was a rare event. The other download jobs were unaffected.

I agree about the value of being able to analyze failures.

(*) That remains true - yesterday there were four failures on a download of a few hundred files. All arrived in a single retry.

SebastienDenvil commented 7 years ago

Automatic retries could be easily made using a crontab as part of a cleaning house procedure. We have been doing production-scale replication with synda and this has never been a limit. Inconsistency of search result between index nodes has been much more problematic for example.