Open nkemnitz opened 1 year ago
The downside of treating all errors as transient is that real errors will get swallowed, and the user will have to dig into the k8s logs to figure out why the tasks are not getting completed.
What is driving you toward this feature more -- the fact that the tasks get bloated with more transient conditions, or that it's annoying to update the transient condition list when encountering them?
I haven't measured how much bigger the tasks are with the transient error list, and if it impacts task retrieval. For now, the main issue for me is definitely the second part. Also because I am often confused whether the problem is with me or not:
Over the weekend I copied a cutout from a public dataset, which almost immediately crashed with a JPEG Decompression error. Thought maybe I messed up the bounding box, or the source dataset is corrupt, but couldn't find anything. Eventually I re-ran the same code without any changes and it went through just fine. Must have been an incomplete file transfer that somehow made it through all the other checks in boto and cloudvolume.
Long-running CPU job kept failing with SSL errors (timeouts?) every 30 minutes or so. The transient error logic we have right now works, but as they are part of the task payload itself, the tasks are getting more and more bloated. And we probably keep running into more and more of these temporary errors over time: