Unexpectedly large custom metric ingest volume generated

ascherkus commented 4 years ago

Not sure if a bug here or with census-instrumentation/opencensus-go or if it's working as intended (apologies in advance!) -- but I could use some sanity checking here. I couldn't find a mailing list either hence filing the issue.

What version of the Exporter are you using?

v0.13.1

What version of OpenCensus are you using?

v0.22.3

What version of Go are you using?

go1.12.7

What did you do?

During local development of a grpc server I noticed that I quickly accumulated Stackdriver Monitoring ingest volume (and cost).

What did you expect to see?

That local development of a single server wouldn't generate any meaningful Stackdriver Monitoring ingest volume.

What did you see instead?

Larger than expected (but potentially correct) ingest volume.

Additional context

I've got a very simple gRPC server that only has ocgrpc.DefaultServerViews registered, which can be found here: https://github.com/census-instrumentation/opencensus-go/blob/master/plugin/ocgrpc/server_metrics.go

I've registered the Stackdriver Exporter as follows:

sd, err := stackdriver.NewExporter(stackdriver.Options{
    MonitoredResource:       monitoredresource.Autodetect(),
    DefaultMonitoringLabels: labels,  // contains a single label
})

As far as I can tell, monitoredresource.Autodetect() results in three additional labels (project_id, instance_id, and zone).

So I've got the 4 default ocgrpc views and 4 labels.

If my logs are correct, I had a few days where ~100 RPCs generated ~30MB of ingest volume daily, which works out to 307kB per RPC (I was billed appropriately as well). In a production environment this would quickly become prohibitively expensive.

The pricing docs say "the ingestion volume for a scalar data type is 8 bytes and for a distribution data type is 80 bytes" [1], so I'm a bit surprised as to how a low volume of RPCs can inflate to such a large volume of ingest data.

Considering I'm using the "defaults" for both ocgrpc and Stackdriver Exporter -- is this really expected behaviour or is there something I'm doing wrong here and/or haven't considered? Any help or pointers would be appreciated!

[1] https://cloud.google.com/stackdriver/pricing#monitoring-costs

james-bebbington commented 4 years ago

Note it's not expected that total ingested bytes is proportional to the number of RPCs. The cost is instead proportional to the duration of monitoring, the cardinality of the labels, the number of metrics (views), and the configured reporting interval. This makes metrics a very cheap option when you are generating a lot of data, but not necessarily that cheap when you are generating a very small amount of data, as you will pay nearly the same cost regardless.

Under the following scenario:

There are 20 combinations of label values for each metric (these are distinct timeseries, and each reports a distinct data point on each export)
All label combinations were observed in the first 1 hour of the traffic
There are 4 metrics (3 counters and 1 distribution) in the default server view of grpc. i.e 3x8+80 = 104 bytes
Run for 1 day with default reporting interval of 1 minute

I would expect the ingest amount to be: 20 labels (3 counters 8 bytes + 1 distribution 80 bytes) 60 minutes * 24 hours = ~3MB

That's still significantly under than what you have observed. Note some of the OpenCensus examples set a ReportingInterval of 10 seconds which would increase this value by a factor of 6 which is close to what you observed (although still slightly less).

If you want someone to look into your account, you can take a look at the options available at https://cloud.google.com/stackdriver/docs/support

ascherkus commented 4 years ago

Thank you for the detailed reply! I pored over all the docs but this is a really helpful concise summary on how bytes ingested gets computed.

While I understood the bit about cardinality, I totally missed the part about how reporting interval impacts bytes ingested. My mental model was that if (say) zero RPCs occurred, then no data would be generated, and hence there would be nothing to report. If I understand you correctly you're saying that data is still reported regardless (I haven't dug through the source but let me know if that's about right).

Is there a downside to having a much higher reporting interval? (e.g., 5 minutes?)

FWIW as an experiment I only enabled the ServerCompletedRPCsView count metric and the bytes ingested dropped significantly (100s of KB/day for our production traffic). I suspect distributions with low traffic were likely the culprit. Anyway this is low enough for my own purposes for now so I don't think there's a bug here.

james-bebbington commented 4 years ago

While I understood the bit about cardinality, I totally missed the part about how reporting interval impacts bytes ingested. My mental model was that if (say) zero RPCs occurred, then no data would be generated, and hence there would be nothing to report. If I understand you correctly you're saying that data is still reported regardless (I haven't dug through the source but let me know if that's about right).

Yes. At each collection interval, all metrics will export their current aggregated values (for each timeseries, a.k.a. combination of label values, and for each distribution bucket) regardless of if there was any change since the last collection interval.

Is there a downside to having a much higher reporting interval? (e.g., 5 minutes?)

Just that you will only be able to see data at a 5 minute granularity. This could be annoying if you're looking at recent data as this will get up to 5 minutes out of date. Additionally, if you were to configure alarms based on latency or error rate, for example, you might have to wait longer before being notified of issues with your server.

If you only care about long term analysis, this might not matter too much.

census-ecosystem / opencensus-go-exporter-stackdriver