Open carbocation opened 5 years ago
Just got bit by what seems to be the collision of (a) transient Google Storage errors occurring at (b) the end of a dsub / google pipelines job again. This time it was across a wide range of jobs that finished around the same time. These jobs represented approximately CPU-16,000 hours of compute which will have to be re-done. Just wanted to mention it because this seems to be an occasional, but recurring problem.
I guess I could manually delocalize and wait / retry within my script wrapper. Just feels a bit hacky since it's a general issue (albeit uncommon).
Interestingly, this wasn't resolved by re-running. I confirmed that the disk is ~10x larger than the output files. However, only VMs running on the smallest files actually delocalize the complete dataset.
Is there something like a time limit on how long delocalization is allowed to occur before it gets forcibly terminated? It didn't appear to be so, but it's a bit odd that VMs running on chr22 delocalize completely while, e.g., those running the larger chr10 get truncated.
Hi @carbocation !
In reviewing this, we are a little unclear on where the failure is actually occurring. Do you have a log file from the VM, dstat
error message, or the underlying pipeline operation? There are many retries built into both dsub
as well as gsutil
, so it will be useful to understand specifically where the failure was.
Is the problem you reported last week actually the same as reported in 2019? There have been many changes, including more/different retries added since then.
Thanks!
Thanks for taking a look. I'm using the google-cls-v2
provider which only allows me to fetch statuses for the past ~100 jobs (as far as I can tell) so this information is lost. Since it's reproducible, I'll try to run just one specific task where this is happening, so that I can give you a more useful set of logs (dsub & GCP).
I have a job (dsub version 0.3.2) that took ~ 6 hours and successfully completed (as per the stdout log file), but failed to delocalize:
This looks like one of those transient API or credentials errors, with 2 attempted retries, and finally full failure. The job is now re-running.
Given that the job took 6 hours, I'd have been willing to have that thing retry out to 30+ minutes, particularly for this type of error which happens so often, rather than re-running the whole thing.
So, either a user-settable retry count or a job-duration-aware retry duration might be a nice feature.