census-ecosystem / opencensus-go-exporter-stackdriver

OpenCensus Go exporter for Stackdriver Monitoring and Trace
Apache License 2.0
67 stars 79 forks source link

Stackdriver fails to export metrics after living long enough #263

Closed ascherkus closed 4 years ago

ascherkus commented 4 years ago

Please answer these questions before submitting a bug report.

What version of the Exporter are you using?

contrib.go.opencensus.io/exporter/stackdriver v0.13.1

What version of OpenCensus are you using?

go.opencensus.io/plugin/ocgrpc v0.22.3

What version of Go are you using?

go version go1.12.13 darwin/amd64

What did you do?

As far as I can tell using stock ocgrpc plugin and stackdriver export features across the board.

server := grpc.NewServer(grpc.StatsHandler(&ocgrpc.ServerHandler{})
if err := view.Register(ocgrpc.DefaultServerViews...); err != nil {
  log.Fatalf("failed to register ocgrpc server views: %v", err)
}

sd, err := stackdriver.NewExporter(stackdriver.Options{})
if err != nil {
  log.Printf("failed to create stackdriver exporter: %v", err)
} else {
  defer sd.Flush()
  trace.RegisterExporter(sd)
  view.RegisterExporter(sd)
}

// Start server...

Let server live for a long time and drive some amount of traffic to it.

What did you expect to see?

Exporting to consistently work without errors.

What did you see instead?

After receiving constant traffic it seems like there are eventually enough inconsistencies in the data that all exporting fails: stackdriver.go:464: Failed to export to Stackdriver: rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: timeSeries[2,3,6,7]; Points must be written in order. One or more of the points specified had an older start time than the most recent point.: timeSeries[0,1,4,5]

Looking at Google Cloud Monitoring it's clear that my grpc-related metrics eventually all stop reporting data.

Additional context

Might be same root cause as #52 and #70 but my error messages seem more specific than simply "an internal error occurred".

james-bebbington commented 4 years ago

Apologies for taking so long to respond to this.

I believe the first error means you are sending multiple metrics with the exact same label keys to Cloud Monitoring within a certain period of time (10s). Are you running multiple instances of your application under the same "resource" that could produce the same set of metrics?

ascherkus commented 4 years ago

No worries! I think you tipped me off in the right direction... I run my applications as a go binary inside of a distroless docker image inside of Google Cloud Run, so I would have thought the labels should be different based on the Cloud Run instance that spins up my applications.

However, based on the documentation for DefaultMonitoringLabels it defaults to this defaults to a single label with key "opencensus_task" and value "go-<pid>@<hostname>"

Sure enough in my metrics the resulting label is go-1@localhost since (I believe?) distroless boots my applications as PID 1. I'll see if I can key this off based on some Cloud Run metadata (or similar) and see if that helps.

ascherkus commented 4 years ago

Yeah setting MonitoredResource combined with DefaultMonitoringLabels fixes things so bug was on my end. Thanks for the pointer!