AxonFramework / AxonFramework

Framework for Evolutionary Message-Driven Microservices on the JVM
https://axoniq.io/
Apache License 2.0
3.34k stars 794 forks source link

When `EmbeddedEventStore` enables `optimizeEventConsumption` (default) it will cause a memory leak of `Node` sometimes #2954

Closed yitizi closed 3 months ago

yitizi commented 10 months ago

Basic information

Steps to reproduce

use the JdbcEventStorageEngine and tracking mode to consumer events from eventbus.and startup 2 application. firstly,the eventstore has some event ,and then the trackingprocessor will process those events. when consume to the end,it will start a EventProducer to fetch events instead of the privateStream. secondly,insert a lot of events to the event store,about 10000,which can't be done immediately. thirdly,release the trackingtoken which hold it,so another one will take it. it happens occasionally in production environment.i repeat it once,but when i call heapdump i has bean recycle by full gc. the production environment is arm64 by the way.

Expected behaviour

the application can be stable running.

Actual behaviour

the oldest node(not the EmbeddedEventStore hold,but exist in G1 Old Gen) is no gc root ,but can't be recycle,and the survivor space memory usage is keep growing until it cause oom.

mat images

A5F85012-42FA-4F16-AC1B-075C97993CB5 BCE8AB91-2E9A-4DB0-87A0-CCD60DA53140 image image

smcvb commented 10 months ago

First off, thanks for filing this concern with us, @yitizi.

Secondly, perhaps an obvious pointer, but Axon Framework 4.4.5 is an old version (November 2020) of Axon Framework, as we are currently at release 4.9.1. Although we make exceptions from time to time, we aim to provide critical bug fixes about three minor releases back.

Hence, it would be very helpful if you could show the same predicament under a more recent release of Axon Framework. Not just for us but also for you and your team receiving or coming to a fix, of course.

Another point that interests me, is your specific mention of the optimizeEventConsumption property being set to true in your environment. Does that mean you have shown the issue not to happen when optimizeEventConsumption is set to false?

Now let me comment a little to your explanation:

firstly,the eventstore has some event ,and then the trackingprocessor will process those events. when consume to the end,it will start a EventProducer to fetch events instead of the privateStream.

The EmbeddedEventStore, when optimizeEventConsumption is enabled (which is the default), will aim to use a common EventConsumer among different StreamingEventProcessor instances as a means to optimize. If a given StreamingEventProcessor falls outside the cache window of (by default) 10_000 events, the StreamingEventProcessor will receive a custom EventConsumer as its message source.

secondly,insert a lot of events to the event store,about 10000,which can't be done immediately.

This number you're inserting is pretty identical to the default cache window of the EmbeddedEventStore. Have you, perchance, made any other configuration changes to the EmbeddedEventStore next to specifically setting the optimizeEventConsumption to true? As stated earlier, when making this shift, your environment will consciously shift from a common EventConsumer for all tailing StreamingEventProcessors to unique EventConsumer instances per StreamingEventProcessor.

This raises another question: how many StreamingEventProcessors (as you're using version 4.4.5, only the TrackingEventProcessor applies) does your application run?

thirdly,release the trackingtoken which hold it,so another one will take it.

How are you exactly telling application X to release "a" TrackingToken so that the other instance picks it up?


Concluding: let's get to the bottom of this, @yitizi :muscle:

yitizi commented 10 months ago

First off, thanks for filing this concern with us, @yitizi.

Thanks for the reply. You suggested that I update the Axon Framework to the newest version, but the application is already published in production, so the update cannot be done quickly. I also noticed that the EventProducer code hasn't changed between versions 4.4.5 and 4.9.1.

Secondly, perhaps an obvious pointer, but Axon Framework 4.4.5 is an old version (November 2020) of Axon Framework, as we are currently at release 4.9.1. Although we make exceptions from time to time, we aim to provide critical bug fixes about three minor releases back.

Hence, it would be very helpful if you could show the same predicament under a more recent release of Axon Framework. Not just for us but also for you and your team receiving or coming to a fix, of course.

Another point that interests me, is your specific mention of the optimizeEventConsumption property being set to true in your environment. Does that mean you have shown the issue not to happen when optimizeEventConsumption is set to false?

Yes, only when optimizeEventConsumption is set to true, will the EventProducer initiate, and the linked list-like usage will be activated. When the EventProducer opens an event stream with JdbcEventStorageEngine, it fetches to the end, regardless of the existence of trailing consumers. Therefore, when I insert numerous events, they get cached in memory, and the trimCache method resets the oldest. However, these cached nodes still need to wait for the JVM garbage collection to be triggered. Now let me comment a little to your explanation:

firstly,the eventstore has some event ,and then the trackingprocessor will process those events. when consume to the end,it will start a EventProducer to fetch events instead of the privateStream.

The EmbeddedEventStore, when optimizeEventConsumption is enabled (which is the default), will aim to use a common EventConsumer among different StreamingEventProcessor instances as a means to optimize. If a given StreamingEventProcessor falls outside the cache window of (by default) 10_000 events, the StreamingEventProcessor will receive a custom EventConsumer as its message source.

I have no doubt about the capability to switch from a standard EventConsumer to a custom EventConsumer. It's a fantastic feature that can significantly decrease the usage of database connections.

secondly,insert a lot of events to the event store,about 10000,which can't be done immediately.

This number you're inserting is pretty identical to the default cache window of the EmbeddedEventStore. Have you, perchance, made any other configuration changes to the EmbeddedEventStore next to specifically setting the optimizeEventConsumption to true? As stated earlier, when making this shift, your environment will consciously shift from a common EventConsumer for all tailing StreamingEventProcessors to unique EventConsumer instances per StreamingEventProcessor.

This raises another question: how many StreamingEventProcessors (as you're using version 4.4.5, only the TrackingEventProcessor applies) does your application run?

about two. image

thirdly,release the trackingtoken which hold it,so another one will take it.

How are you exactly telling application X to release "a" TrackingToken so that the other instance picks it up?

in debug environment is calling the org.axonframework.eventhandling.TrackingEventProcessor#releaseSegment(int) method. in production sometime just occur an error.but i search for the application i mention before,it seems to is just trigger the cleaner ,but just can't be done?im sure it is just two tracking.it has those log. image image Concluding: let's get to the bottom of this, @yitizi 💪

Through the EventProducer's fetching method and the caching of nodes using a linked list approach, it puts pressure on the garbage collector (GC). Eventually, this prevents the garbage collector from functioning properly. @smcvb

smcvb commented 10 months ago

Thanks for the reply. You suggested that I update the Axon Framework to the newest version, but the application is already published in production, so the update cannot be done quickly. I also noticed that the EventProducer code hasn't changed between versions 4.4.5 and 4.9.1.

Although I am aware that we have practically made no changes to the EmbeddedEventStore the last several release, the chances of a patch release on AF 4.4, 4.5, and 4.6 is very slim. So, although your input here is appreciated, if you want to use a fix once present, you will simply be inclined to upgrade to a more recent version of Axon Framework.

Note that this choice is not something we've made lightly. It is purely in place to minimize the maintenance effort for the Axon Framework team. Given the team size, we are unfortunately required to make this request of you. My apologies for that.


With that said, I will reflect on your others comments.

Yes, only when optimizeEventConsumption is set to true, will the EventProducer initiate, and the linked list-like usage will be activated. When the EventProducer opens an event stream with JdbcEventStorageEngine, it fetches to the end, regardless of the existence of trailing consumers. Therefore, when I insert numerous events, they get cached in memory, and the trimCache method resets the oldest. However, these cached nodes still need to wait for the JVM garbage collection to be triggered.

Thanks for the investigation here, @yitizi :-) Do you perhaps also have an idea how to adjust trimCache to release resources quicker? To be frank, I am at this stage not entirely certain how to improve things there...hence the request.

about two. (I would prefer to quote a helpful text, but in this case there's only an image I can work with.)

It seems like your environment only has two Streaming Event Processors in place.

It's a fantastic feature that can significantly decrease the usage of database connections.

Although true, if you just have two Streaming Event Processors with a single token each, the TrackingEventProcessor will only open a single event stream per instance to begin with. As such, the optimization is very limited, and apparently causes more harm in your environment than benefit.

Granted, if you would increase the number of segments for your TrackingEventProcessor, you would have an event stream (so database connection) per segment. This is in turn extremely easily covered with the PooledStreamingEventProcessor, which in most cases has better performance characteristics any how. However, the PooledStreamingEventProcessor was introduced in 4.5; again meriting an upgrade of your environment to a more recent release of Axon Framework.

in debug environment is calling the org.axonframework.eventhandling.TrackingEventProcessor#releaseSegment(int) method. in production sometime just occur an error.but i search for the application i mention before,it seems to is just trigger the cleaner ,but just can't be done?im sure it > is just two tracking.it has those log.

Furthermore, they're both sagas, which I base on the processor name. Giving how most users implement Sagas, it is to be expected one or both of your Streaming Event Processors is lacking behind compared to the other.

The log messages provided by Axon Framework are set to WARN on purpose, as this typically shouldn't happen given bursts of events. And if you do, it would require (1) switching to disable the optimizeEventConsumption or (2) changing the cachedEvents value to a number you would expect to be feasible.

Through the EventProducer's fetching method and the caching of nodes using a linked list approach, it puts pressure on the garbage collector (GC). Eventually, this prevents the garbage collector from functioning properly.

If you would point the issue to be the linked list, that would mean references to Nodes are kept (to long). However, if that would be the case, I would expect this issue you have occurred to happen for a lot more people, which I haven't noticed to be the case either :thinking: Just an idea, but perhaps it helps if you enforce your EmbeddedEventStore to clean out EventConsumers that fall behind quicker. To achieve that, you can adjust the cleanupDelayMillis value, which defaults to 10_000.

Would you mind running a test with that value set to, e.g., 1_000? Just to see whether the issue remains by the way!

By the way, I have spent quite some time in the EmbeddedEventStore to look for a spot where the Node instances that are stored may keep the Node#next field. As in your original comment, it is the Node#next field that hits the 4Gb mark.

The most reasonable culprits of this, would be the Node oldest value of the overall EmbeddedEventStore, and the Node lastNode of an EventConsumer. The latter should be covered by the Cleaner task`. The former sounds the most reasonable culprit, but to be frank, I would expect this predicament to occur for virtually every users of Axon Framework if that's the case :thinking:


So, let me conclude this comment:

  1. Upgrading to a recent version of Axon Framework helps you to receive bug fixes at all. If you decide to stick to old versions, the effort lies on your shoulders instead.
  2. Having just two Sagas does not necessarily merit the use of the optimizeEventConsumption property. Perhaps this will change in the future, but as you're still on 4.4.5, I anticipate there's not to much active development happening on this project.
  3. Having just one segment per Tracking Event Processor does not necessarily merit the use of the optimizeEventConsumption property.
  4. If you could validate in your debug environment whether lowering the cleanupDelayMillis value resolves the predicament, that would be valuable information.

Thanks again for opening this issue, @yitizi! :-)

yitizi commented 10 months ago

Thanks for your advice. We are in the process of updating Axon to version 4.7.6 to see if it resolves the issue. The problem has occurred, and we are working to identify the cause. Given that the code appears to have undergone minimal changes, we suspect it might recur. Additionally, this problem is not 100% reproducible.

smcvb commented 3 months ago

Closing this issue due to inactivity.

But, be sure to reach out if it occurs again, @yitizi! In that case, we will reopen the issue to deal with it further.