DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
264 stars 44 forks source link

Feature request: make delocalization retry quantity user-settable or related to duration of job #162

Open carbocation opened 5 years ago

carbocation commented 5 years ago

I have a job (dsub version 0.3.2) that took ~ 6 hours and successfully completed (as per the stdout log file), but failed to delocalize:

Job: stage1--jamesp--190616-035330-93
Launched job-id: stage1--jamesp--190616-035330-93
To check the status, run:
  dstat --provider google-v2 --project ukbb-analyses --jobs 'stage1--jamesp--190616-035330-93' --users 'jamesp' --status '*'
To cancel the job, run:
  ddel --provider google-v2 --project ukbb-analyses --jobs 'stage1--jamesp--190616-035330-93' --users 'jamesp'
Waiting for job to complete...
  stage1--jamesp--190616-035330-93 (attempt 1) failed. Retrying.
  Failure message: ServiceException: 401 Anonymous caller does not have storage.objects.list access to ukbb_v2.
ServiceException: 401 Anonymous caller does not have storage.objects.list access to ukbb_v2.
ServiceException: 401 Anonymous caller does not have storage.objects.list access to ukbb_v2.

This looks like one of those transient API or credentials errors, with 2 attempted retries, and finally full failure. The job is now re-running.

Given that the job took 6 hours, I'd have been willing to have that thing retry out to 30+ minutes, particularly for this type of error which happens so often, rather than re-running the whole thing.

So, either a user-settable retry count or a job-duration-aware retry duration might be a nice feature.

carbocation commented 3 years ago

Just got bit by what seems to be the collision of (a) transient Google Storage errors occurring at (b) the end of a dsub / google pipelines job again. This time it was across a wide range of jobs that finished around the same time. These jobs represented approximately CPU-16,000 hours of compute which will have to be re-done. Just wanted to mention it because this seems to be an occasional, but recurring problem.

I guess I could manually delocalize and wait / retry within my script wrapper. Just feels a bit hacky since it's a general issue (albeit uncommon).

carbocation commented 3 years ago

Interestingly, this wasn't resolved by re-running. I confirmed that the disk is ~10x larger than the output files. However, only VMs running on the smallest files actually delocalize the complete dataset.

Is there something like a time limit on how long delocalization is allowed to occur before it gets forcibly terminated? It didn't appear to be so, but it's a bit odd that VMs running on chr22 delocalize completely while, e.g., those running the larger chr10 get truncated.

mbookman commented 3 years ago

Hi @carbocation !

In reviewing this, we are a little unclear on where the failure is actually occurring. Do you have a log file from the VM, dstat error message, or the underlying pipeline operation? There are many retries built into both dsub as well as gsutil, so it will be useful to understand specifically where the failure was.

Is the problem you reported last week actually the same as reported in 2019? There have been many changes, including more/different retries added since then.

Thanks!

carbocation commented 3 years ago

Thanks for taking a look. I'm using the google-cls-v2 provider which only allows me to fetch statuses for the past ~100 jobs (as far as I can tell) so this information is lost. Since it's reproducible, I'll try to run just one specific task where this is happening, so that I can give you a more useful set of logs (dsub & GCP).