asynkron / protoactor-go

Proto Actor - Ultra fast distributed actors for Go, C# and Java/Kotlin
http://proto.actor
Apache License 2.0
5.09k stars 523 forks source link

Possible memory leak #1110

Open waeljammal opened 6 months ago

waeljammal commented 6 months ago

Hi,

We are using clustering with consul and there seems to be a memory leak with gossip. I started new instances and left them idle for a short while (10 mins) and saw the following in parca.

It keeps accumulating entries in ConcurrentMap and never releases, I've left it running for up to an hour with the same result, memory just keeps increasing.

We do not send anything over gossip so this is purely internal to clustering and we do not subscribe to anything on the cluster gossip.

It seems to eventually flatten out though but after creating over 50K+ objects, not sure what it's allocating here have not looking into it but that concurrent map keeps growing for some time and eats up a decent chunk of memory. I did have a quick look in a debugger but all I could see were 32 entries each containing 0 items and a rw mutex so could not figure out where these in use allocations are going.

We are initializing the cluster like so, basically defaults:

    config := remote.Configure(a.opts.Config.BindAddress, a.opts.Config.BindPort)
    config.AdvertisedHost = fmt.Sprintf("%s:%d", a.opts.Config.AdvertiseHost, a.opts.Config.BindPort)

    if provider, err = consul.NewWithConfig(&api.Config{
        Address: address,
    }); err != nil {
        return err
    }

    clusterConfig := cluster.Configure(a.opts.ClusterName, provider, lookup, config, cluster.WithKinds(a.kinds...))
    clusterConfig.RequestTimeoutTime = time.Second * 30

    c := cluster.New(a.as, clusterConfig)
    a.cluster = c
    c.StartMember()

Screenshot 2024-05-07 at 10 42 51 am

rogeralsing commented 6 months ago

By the looks of it from the screenshots, It seems like the Future processes are not cleared out from the local ProcessRegistry.

One thing that caught my eye is this line: clusterConfig.RequestTimeoutTime = time.Second * 30

Does it look the same if you set that to say 5 seconds? does it flatten out earlier then?

One possible issue could be that futures are not cleared until the timeout expires, even if completed successfully. I´m not saying this is the case, but if we have such a bug, then it would likely manifest this way.

Another possibility might be if the ConcurrentMap do keep the already allocated size even when entries are removed.

We will have to look deeper into all this. Any more data from your side would be much appreciated.

lrweck commented 6 months ago

I've seen the same behaviour. From what I could gather, it is the second option (ConcurrentMap keeps already allocated size and does not decrease).

waeljammal commented 6 months ago

Hi, sorry for the late response, I've been away. I'll give your recommendation a try but seems Irweck thinks this might have something to do with ConcurrentMap but I'll give it a shit either way and report back.

waeljammal commented 6 months ago

It still happens after reducing the RequestTimeoutTime to 5 seconds, memory usage keeps going up same as before.

lrweck commented 4 months ago

@rogeralsing have you had the time to check if the memory increase is indeed from ConcurrentMap?