Closed dterror-zz closed 6 years ago
I mean, we could still merge this, even if it doesn't solve all problems.
Hm, external signals would be node failures and such?
We should be able to reproduce IOExceptions, the bulk should be those, right? What happens if there's a timeout?
Yeah, node failures or using too much memory. I was able to reproduce IOExceptions and they're fine.
What I don't know is timeouts. I think they're gentle, but I couldn't reproduce one (haven't tried too hard either)
As it turns out this is also insufficient. It depends on the failure mode:
abortTask()
which will perform the cleanup and recover.I don't really know in what category our past failures fall into, I feel like they're mostly IOExceptions. But I'm sure we had un-recoverable ones too.
I have example YARN urls for the failures I've been able to reproduce, I can send it to you.