etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
46.74k stars 9.63k forks source link

Gofailpoints should simulate proper process crash instead of panic #18142

Open serathius opened 2 weeks ago

serathius commented 2 weeks ago

What would you like to be added?

Using panic in failpoints results in process exiting, however the exit is handled through Golang runtime. It would be preferred to do a proper hard crash. This should be achievable by calling os.Exist(1).

cc @MadhavJivrajani @fuweid @henrybear327

Why is this needed?

More realistic failpoints.

henrybear327 commented 2 weeks ago

/assign @henrybear327

Hey @serathius, if I understand correctly, we would like to add a new failpoint type that leverages os.Exit(1) instead of panic(), is it? :)

serathius commented 2 weeks ago

Yes, would be also good to confirm how big is the difference between them. Not a Go expert but maybe there is explanation in Go documentation how Go runtime handles panic. It should help us confirm if we should have both options.

ALso, to not lose functionality from panic, before we call os.Exit(1) we would dump stacktrace to stderr.

henrybear327 commented 2 weeks ago

Yes, would be also good to confirm how big is the difference between them. Not a Go expert but maybe there is explanation in Go documentation how Go runtime handles panic. It should help us confirm if we should have both options.

ALso, to not lose functionality from panic, before we call os.Exit(1) we would dump stacktrace to stderr.

From my initial research, panic() will stop the ordinary flow of control, execute the functions defined with defer, and return to its caller. It's recoverable with recover() within the defer function if you want to handle it there.

On the other hand, os.Exit(1) will terminate the program, without executing the defer functions, with no chance to recover, it's a hard stop of the program.

References:

henrybear327 commented 2 weeks ago

For simulating a failing node, os.Exit(1) would be a more realistic approach as it directly terminates the program. A quick code search (if I am not mistaken) shows that we are not handling the panic() with recover() within the codebase at the top level, so the effect of terminating a program seems to be the same to me so far (maybe there might be other shutdown mechanisms that I am not aware of so my conclusion might be wrong). But panic() will contain more debugging traces for future analysis though.