Altinity / clickhouse-operator

Altinity Kubernetes Operator for ClickHouse creates, configures and manages ClickHouse clusters running on Kubernetes
https://altinity.com
Apache License 2.0
1.86k stars 457 forks source link

Nil pointer dereference in metrics exporter #1187

Closed zcross closed 1 year ago

zcross commented 1 year ago

I recently observed this on 0.21.2 while bootstrapping a brand new installation. FWIW, it was "self healing" after a few minutes: ongoing reconciliation operations seemed to bring the installation(s) into a state that caused the status to no longer be nil.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x198 pc=0x1370d8e] goroutine 1 [running]:
github.com/altinity/clickhouse-operator/pkg/apis/metrics.(*Exporter).DiscoveryWatchedCHIs(0x16c6661?, {0x18f6338, 0xc00002e6e0}, 0xc0000317f0) /clickhouse-operator/pkg/apis/metrics/exporter.go:295 +0x3ee
github.com/altinity/clickhouse-operator/cmd/metrics_exporter/app.Run() /clickhouse-operator/cmd/metrics_exporter/app/metrics_exporter.go:108 +0x36b
main.main() /clickhouse-operator/cmd/metrics_exporter/main.go:23 +0x17

https://github.com/Altinity/clickhouse-operator/blob/32ef0fadc3d07884845a1ee3d91be6b6e112ac62/pkg/apis/metrics/exporter.go#L295

My guess is that we can fix this by changing the access of chi.Status.NormalizedCHICompleted to chi.EnsureStatus().GetNormalizedCHICompleted(), or something like that. I can't do it right this moment, but I'll circle back to this issue when I get some bandwidth.

To generalize: maybe we want to use Golang visibility to make raw/unsynchronized fields on Status package-internal to more strongly encourage usage of safe public getter/setter APIs. I considered this in #1119 but IIRC, this resulted in problems when trying to deserialize/serialize the state of Status (because it relied on field visibility)... I think?

zcross commented 1 year ago

Let's close this if the build with #1188 passes? I'm not sure how deterministic it is to repro, but one thing worth trying is just bringing up a fresh installation and seeing if that reproduces on 0.21.2 – then trying again with 0.21.3

Edit: #1189 fixes this w/ compiling code :)

alex-zaitsev commented 1 year ago

@zcross , this is fixed in 0.21.3

zcross commented 1 year ago

Thank you!

zcross commented 1 year ago

@zcross , this is fixed in 0.21.3

@alex-zaitsev Would it be possible to cut a release tag for 0.21.3 to distribute this fix?