[Bug]: Panic handler swallows strack trace even in development mode

ringerc commented 4 months ago

Is there an existing issue already for this bug?

[X] I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

[X] I have read the troubleshooting guide and I think this is a new bug.

What happened?

If a CNPG-I plugin panics in a gRPC call handler, the plugin helper's registered panic handler intercepts the panic and logs it - without the stack trace.

Because the handler doesn't propagate the logger context (see https://github.com/cloudnative-pg/cnpg-i-machinery/pull/31) a custom logger configuration cannot be used to work around this either.

Cluster resource

No response

Relevant log output

{"level":"info","ts":1719969643.970347,"caller":"pluginhelper/server.go:195","msg":"Panic occurred","error":"runtime error: invalid memory address or nil pointer dereference"}

Code of Conduct

[X] I agree to follow this project's Code of Conduct

armru commented 3 months ago

Hello @ringerc, is this still relevant? Any proposed solutions?

ringerc commented 3 months ago

Yes, it's still relevant @armru

I wasn't able to quickly figure out what is going wrong here. Even after I fixed the various logging issues in https://github.com/cloudnative-pg/cnpg-i-machinery/pull/31, https://github.com/cloudnative-pg/cnpg-i-machinery/pull/12 and https://github.com/cloudnative-pg/cnpg-i-hello-world/pull/7 I didn't manage to get the zap logger to emit a stack trace from the panic handler.

The logging configuration should emit stack traces when the development flag is passed to the plugin.

And the logging in the panic handler should preferably also respect the caller-info of the original panic site, so the logged event shows where the original panic happened, not the code location of the panic handler.

An error like this:

{
  "level": "info",
  "ts": 1719189460.064116,
  "caller": "pluginhelper/server.go:190",
  "msg": "Panic occurred",
  "error": "runtime error: invalid memory address or nil pointer dereference"
}

tells me nothing about where the failure occurred, so it would be nigh-impossible to debug if it occurred in production.

Does the panic helper serve any purpose - or can it simply be removed?

armru commented 3 months ago

this is being tricky to solve haven't come up with a good solution so far

cloudnative-pg / cnpg-i-machinery