Shopify / camus

Kafka->HDFS pipeline from LInkedIn. It is a mapreduce job that does distributed data loads out of Kafka.
7 stars 4 forks source link

Rollbacks on failed tasks #148

Closed dterror-zz closed 6 years ago

dterror-zz commented 6 years ago

As it turns out this is also insufficient. It depends on the failure mode:

I don't really know in what category our past failures fall into, I feel like they're mostly IOExceptions. But I'm sure we had un-recoverable ones too.

I have example YARN urls for the failures I've been able to reproduce, I can send it to you.

dterror-zz commented 6 years ago

I mean, we could still merge this, even if it doesn't solve all problems.

olessia commented 6 years ago

Hm, external signals would be node failures and such?

We should be able to reproduce IOExceptions, the bulk should be those, right? What happens if there's a timeout?

dterror-zz commented 6 years ago

Yeah, node failures or using too much memory. I was able to reproduce IOExceptions and they're fine.

What I don't know is timeouts. I think they're gentle, but I couldn't reproduce one (haven't tried too hard either)