druid-io / druid-operator

Druid Kubernetes Operator
Other
205 stars 93 forks source link

Adding metrics for kubernetes related error #244

Open cintoSunny opened 2 years ago

cintoSunny commented 2 years ago

In the current Druid operator, some Kubernetes related failures are not tracked as errors in metrics. I am planning to add those errors as well. This is the approach I am taking. Wanted to check with you before making the changes. Let me know what you think.

One of the errors I see is raised from interface.go

func (e EmitEventFuncs) EmitEventOnPatch(obj, patchObj object, err error) {
    if err != nil {
        errMsg := fmt.Errorf("Error patching object [%s:%s] in namespace [%s] due to [%s]", patchObj.GetName(), patchObj.GetObjectKind().GroupVersionKind().Kind, patchObj.GetNamespace(), err.Error())
        e.Event(obj, v1.EventTypeWarning, string(druidNodePatchFail), errMsg.Error())
...

I am planning to have a separate pkg - metrics for monitoring. Here is what I did:

package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "sigs.k8s.io/controller-runtime/pkg/metrics"
)

var (
    // TrinoReconcileTotal metric for successful trino reconcile
    DruidPlatformError = promauto.NewCounter(prometheus.CounterOpts{
        Name: "druid_platform_error",
        Help: "Total number of errors raising due to platform like kubernetes",
    })
)

func init() {
    // Register custom metrics with the global prometheus registry
    metrics.Registry.MustRegister(
        DruidPlatformError)
}

And then added metrics.DruidPlatformError.Inc() to the interface.go. Let me know what you think.

Thanks again for all the help.

AdheipSingh commented 2 years ago

@cintoSunny this LGTM . This metric would be helpful, plus it would great if we can draft a proposal for all the monitoring metrics and start implementing.

Thanks !

cc @himanshug @nishantmonu51

cintoSunny commented 2 years ago

Sure. Will do that. I am just writing the list of metrics offline. Will share when done. I am just trying to go through the code, will take some time