asynkron / protoactor-dotnet

Proto Actor - Ultra fast distributed actors for Go, C# and Java/Kotlin
http://proto.actor
Apache License 2.0
1.73k stars 288 forks source link

DefaultMailbox size constantly increasing #2109

Closed guipalazzo closed 3 months ago

guipalazzo commented 8 months ago

I have this application that basically collects data from different sources and store it into CSV files every 30 seconds. For this, it uses around 5k actors. 50% of theses are what I call 'front actors', that are in charge of receiving the data and storing the information in a list. The other 50% are 'back actors', that are in charge of sinking the data into the csv files. Whenever a 30 second cycle ends, 1) the one classified as a 'back actor' starts to receive new data, 2) the 'front actor' sinks the in memory data to the csv files, 3) sends itself a poison pill once concluded and 4) a new 'back actor' is created. Important to say that the data flow is quite steady - there are some peaks, but it does not vary that much in terms of MB/s. Also, there were no deadletters at all observed.

It happens though that at some point (typically after ~20 min from the data collection start), the LOH and POH memory start to ramp up, as per the screenshot below:

image

When I analyze it with dotMemory, I see that 99% of that is due to the size of a 'ConcurrentQueueSegment + Slot:

image

Screenshot taken when memory consumption was at ~1.3 GB:

image

Screenshot taken when memory consumption was at ~4.4 GB:

image

Some more snapshots taken:

image

When I analyze those thousands of 'ConcurrentQueueSegment+Slot', I see that they are all composed by this 'PingMessage' I created internally. These are recurring messages that are sent and used by parent actors in the actor hierarchy, at a certain TimeSpan interval (using 'ActorContext.Scheduler().SendRepeatedly(delay, interval, ActorPid, recurringMessage)'). So, it has nothing to do with the Front and Back actors - they don't even receive it. The PingMessages are sent to 'self'. Put another way, the PingMessages don't flow in the application... So even though there are hundreds of actors being killed every minute or so, the PingMessages are always there and are never touched by this sinking process and the only place they exist is within the actor itself.

I also find strange that this always happens after ~20 min from the application start, even though at this point in time thousands of sinking routines (and therefore thousands of PoisonPills have already been used). So, this doesn't seem to be only related to the recurring messages... It it were for that, the size of the 'ConcurrentQueueSegment+Slot' should have been increasing since the beginning of the applicaiton run.

image

So, this issue looks somehow related to the fact that I kill and create hundreds of actors every minute or so, but at the same time 99,99% of the messages are of the type 'PingMessage', of which the actor class being killed/created are not even aware of.

Could anyone assist, please?

rogeralsing commented 7 months ago

What happens if the application is left running? will it ever release the memory? Depending on workstation or server GC, .NET can sometimes hold on to memory for a long time before releasing it. Making a sort of false positive that there are actually used objects rather than objects that are just waiting to be released.

Usually when a mailbox grows, it´s a sign that the actor is busy doing work, and new messages are arriving faster than the actor can handle. In such cases, some form of backpressure can mitigate this. e.g. reverse the flow and let the actor signal to the producer that it is ready to receive new work rather than just being force-fed work from the outside.

Without more context, it´s hard to say what is going on here specifically. If you could slim this down to a minimal reproducible example, that would make it a lot easier for us to dive into and see what is going on.

guipalazzo commented 7 months ago

I've been trying to debug such behavior for days. It's been hard, I should say. One thing that I haven't mentioned is that I have this same application using Akka.Net and this hasn't been observed after several months of usage. I basically built this version of the application based on the Akka.Net version by replacing the actor system (and the actor especific code, of course).

This is what I observe:

If the application is left running, the memory usage keeps increasing until I run out of if (64 GB). I've already tried to force run GC.Collect every certain interval regardless of any performance impact, but it didn't work as well.

Let me also give you some more background on the application, so you may get a better sense that it is not really that demanding performancewise: the average size of each CSV file being sunk is ~30 KB and the sinking routine is being run every 30-60 seconds (that is, 30 seconds plus a random additional interval up to 30 seconds, just to make sure they are not all sinking at the same time) using ~500 actors, so I'd doubt that the messages would be arriving faster than the actor can handle. There are peaks, which can only occur in the first few seconds or so as the application captures a bit of historical information, but CPU usage rarely goes above 20-30% in such moments on a 8-core PC. Before the ramp up, RAM usage rarely goes above 0.5 GB, even in the first minutes. The only moment I see the CPU peaking to ~50% of so it exactly before the memory consumption ramp up starts. It is as if something is triggering it. From then on, memory consumption start to ramp up fast and nothing seems to stop it.

Also, regarding your comment 'some form of backpressure can mitigate this. e.g. reverse the flow and let the actor signal to the producer that it is ready to receive new work rather than just being force-fed work from the outside', please note that I'm killing the 'back' actors whenever the sinking operation finishes and, while the data is being sunk, there is already a 'front' actor receiving the new messages. This is handled by ~100 parent actors.

What's really tricky to see is that:

1) The PingMessage class is really small, with just an int enum. DotMemory indicates that an instance of such class has 24 bytes at runtime;

2) There are a few hundred actors using it (and those that use it stay alive during the entire application lifecycle). So, if we do some simple math using an incorrect (but conservative) assumption that this 24 byte instances are never garbage collected for some reason, there would be no more than 25 kb being instantiated every minute (24 bytes x 30 seconds x 500 actors), so nothing seems to be explaining this behavior from a reasonable perspective;

I think this one is important:

3) According to dotMemory, each one of the thousands of instances of 'ConcurrentQueueSegment+Slot' that are causing the memory ramp up only have EXACTLY 500 instances of the 24 byte PingMessages themselves inside it. So, supposing that this would be the expected behavior, which does not seem to be, even though the capacity of the ConcurrentQueueSegment+Slot is ~17 MB, they don't even seem to be filled up to its maximum capacity and, for some reason, the UnboundedMailboxQueue makes its ConcurrentQueueSegment create another (nested) instance of it (more on this just below);

4) As I mentioned, the more the application runs after the ramp up, the higher the number of 'ConcurrentQueueSegment'. Please note from the screenshot below that they are all nested / stacked up / pilling up:

image

Would you know what can be creating such nested instances of the ConcurrentQueueSegment in the UnboundedMailboxQueue? Does the fact that all of its instances have exactly 500 instances of the PingMessage indicate something?

guipalazzo commented 7 months ago

I'm doing an obvious test that I should have done before: I just let the application run without having to process any data (that is, no CSV sinking, so no actor creation and actor killing). I see the same behavior: the number of ConcurrentQueueSegments with 500 ping messages increasing fast after ~20 min, as per the screenshots below.

image

image

image

I'll try to create a minimal example of it and send it over in the sequence.

guipalazzo commented 6 months ago

I was finally able to figure out what was happening and the issue was due to some dumb mistake on my end, that was rather difficult to catch. Sorry about that and thanks for your support.