furiko-io / furiko

Kubernetes cron and batch job platform
https://furiko.io
Apache License 2.0
477 stars 21 forks source link

Improve panic recovery handling #5

Open irvinlim opened 2 years ago

irvinlim commented 2 years ago

The code is currently lacking recovery routines where it could crash (e.g. nil pointer exceptions). Since we start many goroutines at different points, we need to investigate a robust way to ensure that we do not forget to handle panic recovery as well.

Some areas which require panic recovery:

joaokorcz commented 7 months ago

Hey, @irvinlim, how are you? i'd like to contribute on the solve of this issue. Are there other people already working with this? I saw the #60 and i guess the problem is close to the issue reported here.

irvinlim commented 6 months ago

Hey @joaokorcz! Apologies for the late reply as I was on vacation.

This is more like a "blanket" issue to try to cover panic scenarios. I think there's a few problems we want to address:

  1. Any goroutines that panic may cause the controller to crash (even if you use defer in the main goroutine), which degrades availability
  2. There isn't a good way to detect such crashes from a developer perspective, and would have to rely on users to report them
  3. In certain scenarios, a panic might be the only way out of a bad situation (e.g. corrupt state that we can't recover from)

I appreciate the enthusiasm to contribute! However, I don't think that this issue is sufficiently well-scoped that I can provide pointers to which you can provide immediate fixes to. I'll remove the good first issue tag, which I believe is improperly tagged.

I'll mark some other issues as "good first issue" in a bit, so if you are still interested in contributing, do check those out!