cloudnative-pg / cnpg-i-machinery

CloudNativePG Interface Machinery
Apache License 2.0
2 stars 3 forks source link

[Bug]: Panic handler swallows strack trace even in development mode #33

Closed ringerc closed 3 months ago

ringerc commented 4 months ago

Is there an existing issue already for this bug?

I have read the troubleshooting guide

What happened?

If a CNPG-I plugin panics in a gRPC call handler, the plugin helper's registered panic handler intercepts the panic and logs it - without the stack trace.

Because the handler doesn't propagate the logger context (see https://github.com/cloudnative-pg/cnpg-i-machinery/pull/31) a custom logger configuration cannot be used to work around this either.

Cluster resource

No response

Relevant log output

{"level":"info","ts":1719969643.970347,"caller":"pluginhelper/server.go:195","msg":"Panic occurred","error":"runtime error: invalid memory address or nil pointer dereference"}

Code of Conduct

armru commented 3 months ago

Hello @ringerc, is this still relevant? Any proposed solutions?

ringerc commented 3 months ago

Yes, it's still relevant @armru

I wasn't able to quickly figure out what is going wrong here. Even after I fixed the various logging issues in https://github.com/cloudnative-pg/cnpg-i-machinery/pull/31, https://github.com/cloudnative-pg/cnpg-i-machinery/pull/12 and https://github.com/cloudnative-pg/cnpg-i-hello-world/pull/7 I didn't manage to get the zap logger to emit a stack trace from the panic handler.

The logging configuration should emit stack traces when the development flag is passed to the plugin.

And the logging in the panic handler should preferably also respect the caller-info of the original panic site, so the logged event shows where the original panic happened, not the code location of the panic handler.

An error like this:

{
  "level": "info",
  "ts": 1719189460.064116,
  "caller": "pluginhelper/server.go:190",
  "msg": "Panic occurred",
  "error": "runtime error: invalid memory address or nil pointer dereference"
}

tells me nothing about where the failure occurred, so it would be nigh-impossible to debug if it occurred in production.

Does the panic helper serve any purpose - or can it simply be removed?

armru commented 3 months ago

this is being tricky to solve haven't come up with a good solution so far