census-instrumentation / opencensus-service

OpenCensus service allows OpenCensus libraries to export to an exporter service rather than having to link vendor-specific exports.
Apache License 2.0
153 stars 63 forks source link

Flaky connection while directly exporting to collector from ocagent exporter #582

Open asutoshpalai opened 5 years ago

asutoshpalai commented 5 years ago

When using the Agent exporter with OpenCensus to directly export to the Collector, instead of exporting to Agent and then to Collector, the connection keeps resetting.

Is this the intended behavior? If so, is there a way/config to maintain stable connection?

As per the blog and design doc, it looks like both the Agent and Collector are optional and we should be able to export directly to the collector.

The bug reproduction

I modified example/main.go to enable debug logs from gRPC as follows:

diff --git a/example/main.go b/example/main.go
index 5fa9f5f..e7932df 100644
--- a/example/main.go
+++ b/example/main.go
@@ -24,13 +24,17 @@ import (
    "time"

    "contrib.go.opencensus.io/exporter/ocagent"
+   "github.com/sirupsen/logrus"
    "go.opencensus.io/stats"
    "go.opencensus.io/stats/view"
    "go.opencensus.io/tag"
    "go.opencensus.io/trace"
+   "google.golang.org/grpc/grpclog"
 )

 func main() {
+   logrus.SetLevel(logrus.DebugLevel)
+   grpclog.SetLogger(logrus.New())
    oce, err := ocagent.NewExporter(
        ocagent.WithInsecure(),
        ocagent.WithServiceName(fmt.Sprintf("example-go-%d", os.Getpid())))
@@ -119,5 +123,6 @@ func main() {
        }
        stats.Record(ctx, mLatencyMs.M(latencyMs))
        fmt.Printf("Latency: %.3fms\n", latencyMs)
+       oce.Flush()
    }
 }

My Agent config:

receivers:
  opencensus:
    address: ":55678"

exporters:
  opencensus:
    endpoint: "localhost:55680"

zpages:
  port: 8884

My Collector config:

log-level: DEBUG
receivers:
  opencensus:
    port: 55680

queued-exporters:
  jaeger-all-in-one:
    num-workers: 4
    queue-size: 100
    retry-on-failure: true
    sender-type: jaeger-thrift-http
    jaeger-thrift-http:
      collector-endpoint: http://localhost:14268/api/traces
      timeout: 5s

zpages:
  port: 8889

When all three are run, we get

INFO[0000] pickfirstBalancer: HandleSubConnStateChange: 0xc000020290, CONNECTING 
INFO[0000] pickfirstBalancer: HandleSubConnStateChange: 0xc000020290, READY 

only once in the logs of example/main.go.

But if we don't run the agent and export directly to Collector (by changing the port in the Collector's config), we get the above the above logs multiple times.

pjanotti commented 5 years ago

Thanks for reporting @asutoshpalai. This is due to the collector not implementing the metrics endpoint: the example periodically tries to send metric data and that resets the connection. The agent on the other hand implements the metrics endpoint and the reset doesn't happen. What happens if you remove the metrics from the example and go straight to the collector? Is that an option for you? That said it is a bug anyway...

asutoshpalai commented 5 years ago

Thanks @pjanotti, that's correct! When I didn't register the exporter with view, everything worked fine. It's good enough for me, but I will leave this issue open if you are looking to fix this in future.

pjanotti commented 5 years ago

Thanks for confirming @asutoshpalai - yes, this is a bug that needs to be fixed. Leaving the issue open.