asynkron / protoactor-go

Proto Actor - Ultra fast distributed actors for Go, C# and Java/Kotlin
http://proto.actor
Apache License 2.0
5.08k stars 522 forks source link

Memory leak when actor metrics are enabled #745

Open antoniomacri opened 2 years ago

antoniomacri commented 2 years ago

Describe the bug Enabling actor metrics:

    system := actor.NewActorSystem(actor.WithMetricProviders(global.MeterProvider()))

causes a memory leak when actors are stopped.

To Reproduce Run the following main:

package main

import (
    "fmt"
    "github.com/asynkron/protoactor-go/actor"
    "github.com/google/uuid"
    "go.opentelemetry.io/otel/metric/global"
    "log"
    "math/rand"
    "net/http"
    _ "net/http/pprof"
    "os"
    "os/signal"
    "syscall"
    "time"
)

func main() {
    system := actor.NewActorSystem(actor.WithMetricProviders(global.MeterProvider()))

    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    for i := 0; i < 20_000; i++ {
        message := &MyData{id: uuid.New().String(), value: randomString(100_000)}
        props := actor.PropsFromProducer(func() actor.Actor {
            return &myActor{data: make(map[string]*MyData)}
        })
        pid, err := system.Root.SpawnNamed(props, message.id)
        if err != nil && err != actor.ErrNameExists {
            fmt.Printf("Error while sending message to actor=%v err=%v\n", pid, err)
        }
        system.Root.Send(pid, message)
        if err := system.Root.PoisonFuture(pid).Wait(); err != nil {
            fmt.Printf("Error while terminating actor=%v err=%v\n", pid, err)
        }
        time.Sleep(5 * time.Millisecond)
    }
    fmt.Println("-- Done.")

    sigs := make(chan os.Signal, 1)
    signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
    <-sigs
}

type myActor struct {
    data map[string]*MyData
}

type MyData struct {
    id    string
    value string
}

func (a *myActor) Receive(ctx actor.Context) {
    switch msg := ctx.Message().(type) {
    case *MyData:
        a.data[msg.id] = msg
    }
}

func randomString(length int) string {
    const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    b := make([]byte, length)
    for i := range b {
        b[i] = letterBytes[rand.Intn(len(letterBytes))]
    }
    return string(b)
}

A few GB of RAM are retained even though the actors are stopped immediately after sending the data message.

By replacing

    system := actor.NewActorSystem(actor.WithMetricProviders(global.MeterProvider()))

with

    system := actor.NewActorSystem()

the memory leak disappears.

Expected behavior The memory is released when the actors are stopped.

Additional context In props.go the default spawner contains this code:

        ctx := newActorContext(actorSystem, props, parentContext.Self())
        mb := props.produceMailbox()

        // prepare the mailbox number counter
        if ctx.actorSystem.Config.MetricsProvider != nil {
            sysMetrics, ok := ctx.actorSystem.Extensions.Get(extensionId).(*Metrics)
            if ok && sysMetrics.enabled {
                if instruments := sysMetrics.metrics.Get(metrics.InternalActorMetrics); instruments != nil {
                    sysMetrics.PrepareMailboxLengthGauge()
                    meter := global.Meter(metrics.LibName)
                    if err := meter.RegisterCallback([]instrument.Asynchronous{instruments.ActorMailboxLength}, func(goCtx context.Context) {
                        instruments.ActorMailboxLength.Observe(goCtx, int64(mb.UserMessageCount()), sysMetrics.CommonLabels(ctx)...)
                    }); err != nil {
                        err = fmt.Errorf("failed to instrument Actor Mailbox, %w", err)
                        plog.Error(err.Error(), log.Error(err))
                    }
                }
            }
        }

Specifically the closurefunc(goCtx context.Context) captures mb and ctx, and is not unregistered after the actor is stopped.

This seems to be the root cause of the memory leak.

antoniomacri commented 2 years ago

Also, it's not clear to me how that metric (ActorMailboxLength) should work. Shouldn't it sum the mailbox length for all the living actors? It instead seems to replace the gauge value with the mailbox length of the last called back actor.

antoniomacri commented 1 year ago

I'm keen to help with this issue.

Any hint on how to fix the leak and correctly implement the mailbox metric?

rogeralsing commented 1 year ago

Thanks for the detailed report. I have to dive into this to see what the memory issue could be

antoniomacri commented 1 year ago

It should be due to the registered callback for each actor, which keeps references to its mailbox and context via the closure.

So changing the callback logic should also fix the leak.

I'm thinking about registering a single callback in the OpenTelemetry Meter. Go implementation of the Meter doesn't provide a way to unregister the callback (although this seems mandated by the spec...). Therefore this should be handled explicitly. WDYT?

Thanks for the response.