Improve performance of stats recording

Is your feature request related to a problem? Please describe. Yes. Usage of opencensus-go to record metrics has a substantial overhead. In real world application, we have seen OC accounting for well over 10% of our memory allocations (and we generate GBs of protobufs per minute - so it should be negligible). This has led us to doing things we really shouldn't have to think about, like adding a caching layer on top of the library.

Describe the solution you'd like Improve performance of the library; in particular memory allocations

Describe alternatives you've considered Adding a caching layer above the library, using a different library.

Additional context

I wrote some benchmarks to compare to the prometheus client. There are two variants, one with a precompute label/tag and one computed in the loops:

func BenchmarkMetrics(b *testing.B) {
    b.Run("oc", func(b *testing.B) {
        mLineLengths := stats.Float64("test", "my-benchmark", stats.UnitDimensionless)
        key := tag.MustNewKey("key")
        v := &view.View{
            Measure:     mLineLengths,
            TagKeys:     []tag.Key{key},
            Aggregation: view.Sum(),
        }
        if err := view.Register(v); err != nil {
            b.Fatal(err)
        }

        for n := 0; n < b.N; n++ {
            allTags := []tag.Mutator{tag.Upsert(key, "val")}
            if err := stats.RecordWithTags(context.Background(), allTags, mLineLengths.M(1)); err != nil {
                b.Fatal(err)
            }
        }
    })
    b.Run("oc-fixed", func(b *testing.B) {
        mLineLengths := stats.Float64("test", "my-benchmark", stats.UnitDimensionless)
        key := tag.MustNewKey("key")
        v := &view.View{
            Measure:     mLineLengths,
            TagKeys:     []tag.Key{key},
            Aggregation: view.Sum(),
        }
        if err := view.Register(v); err != nil {
            b.Fatal(err)
        }

        allTags := []tag.Mutator{tag.Upsert(key, "val")}
        for n := 0; n < b.N; n++ {
            if err := stats.RecordWithTags(context.Background(), allTags, mLineLengths.M(1)); err != nil {
                b.Fatal(err)
            }
        }
    })
    b.Run("prom", func(b *testing.B) {
        g := prometheus.NewGaugeVec(prometheus.GaugeOpts{
            Namespace: "tests",
            Name:      "test",
        }, []string{"key"})
        prometheus.Register(g)
        for n := 0; n < b.N; n++ {
            g.With(prometheus.Labels{"key": "value"}).Add(1)
        }
    })
    b.Run("prom-fixed", func(b *testing.B) {
        g := prometheus.NewGaugeVec(prometheus.GaugeOpts{
            Namespace: "tests",
            Name:      "test",
        }, []string{"key"})
        prometheus.Register(g)
        l := prometheus.Labels{"key": "value"}
        for n := 0; n < b.N; n++ {
            g.With(l).Add(1)
        }
    })
}

Results:

BenchmarkMetrics
BenchmarkMetrics/oc
BenchmarkMetrics/oc-6             864436              1234 ns/op             768 B/op         14 allocs/op
BenchmarkMetrics/oc-6            1208937              1011 ns/op             768 B/op         14 allocs/op
BenchmarkMetrics/oc-6            1000000              1016 ns/op             768 B/op         14 allocs/op
BenchmarkMetrics/oc-fixed
BenchmarkMetrics/oc-fixed-6      1264486               919.2 ns/op           680 B/op         11 allocs/op
BenchmarkMetrics/oc-fixed-6      1284253               956.4 ns/op           680 B/op         11 allocs/op
BenchmarkMetrics/oc-fixed-6      1279734               961.2 ns/op           680 B/op         11 allocs/op
BenchmarkMetrics/prom
BenchmarkMetrics/prom-6          3083409               371.8 ns/op           336 B/op          2 allocs/op
BenchmarkMetrics/prom-6          3202328               385.6 ns/op           336 B/op          2 allocs/op
BenchmarkMetrics/prom-6          3208323               388.9 ns/op           336 B/op          2 allocs/op
BenchmarkMetrics/prom-fixed
BenchmarkMetrics/prom-fixed-6   12074671                95.92 ns/op            0 B/op          0 allocs/op
BenchmarkMetrics/prom-fixed-6   12057554                89.15 ns/op            0 B/op          0 allocs/op
BenchmarkMetrics/prom-fixed-6   13738635                88.36 ns/op            0 B/op          0 allocs/op

So the prometheus counterpart actually has zero allocs once the label is created. It also is 10x faster. not even considering GC overhead, which is substantial, that means that (with above machine), I can record 1M metrics/s with OC and 10M with prom; of course in the real world the metrics recording should be a tiny portion of the CPU used by the process though.

Will be looking out for potential optimization points:

Switching https://github.com/census-instrumentation/opencensus-go/blob/49838f207d61097fc0ebb8aeef306913388376ca/stats/view/collector.go#L69 to return a string directly using concepts from string.Builder removes a call to slicebytetostring and drops allocs by 6.6%

https://github.com/census-instrumentation/opencensus-go/blob/49838f207d61097fc0ebb8aeef306913388376ca/stats/record.go#L131-L135 can be optimized as well. At the very least, we are creating a context to stick a map in it, then immediately extracting the map and discarding the context. The context creation is not free - its pretty expensive.

On top of this, there is no efficient way to pass in a set of tags. We cannot pass a simple map of tags like the prom library, instead we pass mutators. This means the library must allocate its own map, which it doesn't know the size of data it will store (so we cannot hint the make command). newMap also always returns a pointer; this adds additional allocation. In comparison to OC:

type Map struct {
    m map[Key]tagContent
}

the prometheus equivalent is type Labels map[string]string. This allows avoiding an extract allocation just to wrap a map.

Some of these inefficiencies seem to stem from all Record() variants taking in a context and mutating the tags. When this isn't the case, we are taking a huge performance overhead for no reason. If I have a static set of labels, then doing mutatings like tag.Upsert is a waste of time; I can just compute it once ahead of time.

For example, making a small tweak to the API to allow directly passing in a tag.map cuts alloc count in half and allocates 1/5 as much memory:

BenchmarkMetrics/oc
BenchmarkMetrics/oc-6            1000000              1030 ns/op             720 B/op         13 allocs/op
BenchmarkMetrics/oc-fixed
BenchmarkMetrics/oc-fixed-6      1315213               895.4 ns/op           632 B/op         10 allocs/op
BenchmarkMetrics/oc-fast
BenchmarkMetrics/oc-fast-6       1808168               617.7 ns/op           136 B/op          6 allocs/op

code:

func RecordWithTagsFast(m *tag.Map, ms ...Measurement) error {
    recorder := internal.DefaultRecorder
    if recorder == nil {
        return nil
    }
    record := false
    for _, m := range ms {
        if m.desc.subscribed() {
            record = true
            break
        }
    }
    if !record {
        return nil
    }
    recorder(m, ms, nil)
    return nil
}

with RecordWithTagsFast:

13% of CPU wasted on a convTslice because the Recorder interface takes an interface{} as an argument; this is always a []Measurement though.
11% of time of time.Now() called on each record; depending on the aggregator this may be thrown away entirely. This seems to only be used for exemplars.
60% of time on chansend and selectgo. I don't think there is much low hanging fruit that can be done here without re-architect the worker model. Channels may be slower than map+mutex (https://syslog.ravelin.com/so-just-how-fast-are-channels-anyway-4c156a407e45)

So there is actually a way to reduce allocations if we can precompute tags by using some different functions:

    b.Run("oc-ctx", func(b *testing.B) {
        mLineLengths := stats.Float64("test", "my-benchmark", stats.UnitDimensionless)
        key := tag.MustNewKey("key")
        v := &view.View{
            Measure:     mLineLengths,
            TagKeys:     []tag.Key{key},
            Aggregation: view.Sum(),
        }
        if err := view.Register(v); err != nil {
            b.Fatal(err)
        }
        ctx, err := tag.New(context.Background(), tag.Upsert(key, "val"))
        if err != nil {
            b.Fatal(err)
        }
        m := mLineLengths.M(1)
        for n := 0; n < b.N; n++ {
            stats.Record(ctx, m)
        }
    })

Compared to the other method, this saves 50% of allocs:

BenchmarkMetrics
BenchmarkMetrics/oc
BenchmarkMetrics/oc-6            1000000              1021 ns/op             768 B/op         14 allocs/op
BenchmarkMetrics/oc-fixed
BenchmarkMetrics/oc-fixed-6      1308032               933.5 ns/op           680 B/op         12 allocs/op
BenchmarkMetrics/oc-ctx
BenchmarkMetrics/oc-ctx-6        1915941               606.5 ns/op           216 B/op          6 allocs/op

We can improve this more. Record always calls createRecordOption but its not required; rewriting to

func Record(ctx context.Context, ms ...Measurement) {
    if len(ms) == 0 {
        return
    }
    recorder := internal.DefaultRecorder
    record := false
    for _, m := range ms {
        if m.desc.subscribed() {
            record = true
            break
        }
    }
    if !record {
        return
    }
    recorder(tag.FromContext(ctx), ms, nil)
    return
}

drops us from 6->5 allocs (216B -> 136B).

We can save one more by computing the []stat.Measurement array ahead of time:

        m := []stats.Measurement{mLineLengths.M(1)}
        for n := 0; n < b.N; n++ {
            stats.Record(ctx, m...)
        }

so down to 4.

Replacing func(tags *tag.Map, measurement interface{}, attachments map[string]interface{}) with func(tags *tag.Map, measurement []stats.Measurement, attachments map[string]interface{}) brings us to 3 allocs.

That leaves:

https://github.com/census-instrumentation/opencensus-go/blob/49838f207d61097fc0ebb8aeef306913388376ca/stats/view/worker.go#L204. Cannot be avoided without getting rid of the channel-worker model. Or allowing users to directly pass in recordReq or equivalent; then they can precompute it once.
encodeWithKeys: https://github.com/census-instrumentation/opencensus-go/blob/49838f207d61097fc0ebb8aeef306913388376ca/stats/view/collector.go#L63. We are always allocating here multiple times - not just for the original buffer, but also we constantly grow it since or size hint (len(keys)) is not accurate. We can instead change this to:
```
reqLen := 0
for _, k := range keys {
    s, _ := m.Value(k)
    reqLen += len(s) + 1
}
vb := &tagencoding.Values{
    Buffer: make([]byte, reqLen),
}
```

So with all of those tweaks !

Given most of this has landed and OC is deprecated I don't think there is anything left to do. OTEL is much better here: https://blog.howardjohn.info/posts/zero-alloc-otel/

census-instrumentation / opencensus-go

Improve performance of stats recording #1265