Open carbocation opened 5 years ago
Thanks for reporting, @carbocation.
We could add error 99
to the list of TRANSIENT_SOCKET_ERROR_CODES that are retried, although I'm at a loss to come up with a scenario in which this error would be generated transiently.
By any chance, does the time of the error (2019-05-06 02:17:10) coincide with the job finishing or anything else you can point to?
Thanks!
The final jobs were launched at 2019-05-06 02:00:00 according to the pipelines tab, so 2:17 is pretty close to the right time to finish. These jobs each take < 60 seconds in the typical case. (I wasn't actively monitoring at that time.)
I actually now see that four of the jobs launched between 01:59:00 and 02:00:00 failed. (I was wrong in my initial review, when I thought that all jobs succeeded.)
Thanks to the tagging and logging that dsub does, I am able to look for the logs for these specific jobs - interestingly, none of those four jobs has any logfile whatsoever. I don't know if that absence of log information is of any use. I can also tell you that there were no VMs left up and running by the time of this crash.
I ran a set of 50,000 jobs yesterday and they all appear to have completed successfully. However, the dsub tool (which I ran from the cloud shell) yielded the following error at the end. Again, this doesn't seem to have impacted the run, but I haven't seen this error reported so I wanted to mention it.
dsub is 0.3.1: