On heavy load, the PartitioningFilter suddenly flips event processing order

ReinderReinders commented 1 year ago

I have another nicely mysterious bug.

I have run an even bigger loadtest than last time. I have uploaded 100.000 Entities (Create event) which each Entity also having 10 Update events, for a total of 1.1 million events in the Event Store (a little calculating error; I intended to create a test set with 1 million events :) ). The events are in sequence per Entity (i.e. Aggregate), so event 1-11 are the Create and 10 Updates for Entity 1, event 12-23 are for Entity 2, and so on.

I have activated the consuming application, which starts a new subscription, which consumes all events with 1 Event Handler and a partitioningcount of 20. The result is that the consuming application receives a heavy bulk of events to process (the total batch took 1 hour 15 minutes to complete, which is excellent performance. My compliments to Eventuous).

Early on in the batch the application consumes the way it is supposed to consume (i.e. partitioning by stream name and processing events in order):

However, late in the batch it suddenly starts doing something very strange:

All of a sudden, Eventuous processes the events concerning about 250 out of the 100.000 Entities in the wrong order. Remember that the events are stored in the stream (globalposition) in sequence per Entity, so for instance (in this screenshot) Entity 99311 has a Create event with globalposition 1092399 followed by Update events 1092400-1092409. However, as can be seen, for some reason they are processed in exactly the wrong order (backwards).

It makes no sense to me at all. Especially since this behaviour occurs for some time, and then it returns to in-order processing again. A screenshot of the end of the logging shows a reasonably 'garbled' output again:

I have not yet been able to add the improved diagnostics/logging to this batchjob. Will post an update later. But for now I wanted to post this issue in the hopes that someone else has seen it before and has an explanation that would make this a little less mystical.

alexeyzimarev commented 1 year ago

I will make a more generic test SUT for subscriptions and then refactor subscription tests to use it. The plan is to have a single test base, which allows adding subscription tests for different infrastructures easier. Then, I will add some tests with a large number of events and many partitions.

Honestly, I doubt events can be processed out of order within a single partition. I hope you don't set the concurrency level in the subscription options, as it will break the ordering. It is hard to diagnose anything if you don't post the code that produces the result you provided.

ReinderReinders commented 1 year ago

The bootstrapping code for this application is still the generic registration for subscriptions I have created:

       foreach (ExternalEntitySubscription subscription in eventSourcingSettings.ExternalEntitySubscriptions)
        {
            services.AddSubscription<PostgresAllStreamSubscription, PostgresAllStreamSubscriptionOptions>(
                subscription.SubscriptionName,
                builder => builder
                    .AddConsumeFilterLast(new MessageFilter(x =>
                    {
                        foreach (string filter in subscription.FiltersOnStreamName)
                        {
                            if (!x.Stream.ToString().Contains(filter))
                            {
                                return false;
                            }
                        }

                        return true;
                    }))
                    .AddEventHandler<ExternalEntityEventHandler>()
                    // Enables parallel processing. The default option uses the message stream name as partition key, see: https://eventuous.dev/docs/subscriptions/pipes/
                    .WithPartitioningByStream(subscription.PartitionsCount));
        }

The foreach has no function in this test, since I only register a single subscription. There is also but a single filter and all test data passes through the filter (nothing is ignored).

To make matters even worse, I left a batchjob online over the weekend that runs the same import (1.1 million events, in order) with a partitioncount of 1 (I made the same assumption you do, that this issue should be impossible when not partitioning). This should make the batchjob slower (check) but negate all issues arising from partitioning. However, I again have about 250 errors and I am again noticing that at some points, event processing order is flipped:

It also appears that some streams are skipped. In the end result I am missing 112 Entities in the receiving database:

I can see in the logging that it sometimes just skips ahead a hundred or so positions, skipping some whole Streams:

I am using a Postgres database as both the Stream store and the Checkpoint store, and an MSSQL database as the view model store.

edit: code snippet outlining

alexeyzimarev commented 1 year ago

Then it's one of the two:

The subscription code for Postgres has a bug, most probably when reading the page of events from the store
The logger is showing the processing in the wrong order

A better test would be to check the position in the event handler per subscription id, and if the current event position is lower than the previous one - log it with Error level. Disable debug and let it run.

ReinderReinders commented 1 year ago

I have had the same suggestion concerning logging from colleagues, but I am pretty sure the logging is happening in proper order. The log file, for most of its run, shows a completely synchronous sequence:

edit: added a second example

ReinderReinders commented 1 year ago

@alexeyzimarev I have followed your suggestion from above:

-- > A better test would be to check the position in the event handler per subscription id, and if the current event position is lower than the previous one - log it with Error level. Disable debug and let it run.

and have run the same batch with the same input data over the weekend. The log (I'm only showing the Errors here) again shows streams being played backwards near the end of the batch:

What is most bizarre, is that I am receiving (in totals) exactly the same number of NOK results as the previous week! This is an output of the same query that I screenshotted above, except one I did this morning on the latest run:

I am again missing exactly 112 Entitites and exactly 264 Entities were not transferred OK (either 1 or more events were executed out of order, the entire streams was processed backwards, or events were skipped. I have seen all 3 in the log). However, and this is the bizarre part, they are not the same Entities as can be seen in the screenshots from last week. I had hoped this would be the case because it would imply that my test data is corrupt (it is not as far as I can see, I have tracked several streams and messages from the logging through the testset, and everything looks OK and in order. i.e. there are no out-of-sequence events in the test data). I've also tried to pin down an 'error moment', for instance I suspected that something would go wrong the moment the 1 million threshold was crossed (perhaps an issue with too much data being held in memory). But this is not the case (as I would then expect the same Entities to be out of sync). As far as I can see, it is also not an 'exceeded capacity issue' (i.e. after a certain number of events processed, nothing goes right anymore) since in my latest test the last Entity that goes wrong is Nr. 99434 (out of 100000) about 36 seconds before the end of the batch. All Entities after that one are processed correctly. This is (from memory) similar behaviour to what I observed last week (and I've run this batch test several times since then). (edit - and as far as I can remember I don't always get exactly the same end result; for instance I remember seeing a different number of NOK Entities.)

For clarification I should add that I empty the 'receiving' table before each test run, so it's not like I'm accidentally looking at the same test results twice. This makes it harder to debug the issue, since I don't have the test results from previous runs available. I have now created a 2nd receiving database and am running the batchjob with the same input Event Store on that new database (but now with 1 partition instead of 10), so that I can compare the resultsets. Will update this post later.

edit: added some clarification

ReinderReinders commented 1 year ago

Update:

my previous statement was incorrect. I have compared the batchjob that was run this weekend (10 partitions) with the batchjob I have just completed (1 partition) - same source DB, different but identical destination DB's; same application but different configuration (only partition count was changed) - and it turns out it is exactly the same Entities that get out of sync:

It is hard to analyze precisely because a partitioned application logs asynchronously, but yes it appears that it is again the same exact operations that go wrong. For instance, it starts with a stream getting processed backwards:

(log showing only the error lines. Entity 99074 is where the issues start)

due to the size of the batchjob, logging, and resultset it is hard to make an exact comparison. But yes this seems to imply that the issue I am experiencing here has nothing to with the partitions and the commit handler. (also, the run with 10 partitions stopped updating its checkpoint somewhere around the 250000 mark, but this is an issue I have reported elsewhere: #165 (This was a run before the latest bugfix which might resolve that issue). The run with 1 partition did not fail to update its checkpoint).

What the above implies to me is that this is some kind of 'undulating' issue: at exactly the same positions (when some cache or memory is full?) an issue occurs that causes errors for a while, then resolves (after Entity 99434 the rest of the database syncs correctly). I also note that this is still an application in the cloud reading from one cloud database and writing to another cloud database - but the above results and the fact that they are the same imply to me that I am not currently experiencing some kind of latency issue - you would expect that to be more random.

Update: I have run exports on both resultsets and compared the results. The resulting data sets are exactly the same!

ReinderReinders commented 1 year ago

Upon further reflection I should add a few statements to the above 'confusion'.

I can of course no longer guarantee that there is not some kind of issue with corrupt testing data (events in the store), since I am now getting deterministic results. In some of my tests last week there were other, unrelated Infrastructural issues (a process in the receiving database that sometimes caused a deadlock). These have been resolved and this now results in deterministic results. So my previous observation that the number and kind of failures is different every time should be discounted. I have created another test set and will run another batch, but this will take a while (and since I've used the same tooling to create the new test set, this is no guarantee).

However, as stated before I have checked several of the streams and messages for incorrect data manually, and I can see no corrupt data in the Event Store. For instance, let's look at Entity 99434 (one of my failures): (there is an offset between Entity ID in the stream_name and stream_id (1-based instead of 0-based). This does not corrupt the data). Stream position and global position look fine to me.

Also, when I first created the large test set, I had a consuming application also online at the same time, which consumed the events in real-time while the import was running. The resultset was a 100% success (in fact, I wrote the OK/NOK query that I've been using this whole time after this first import). So real-time consumption of 1.1 million events causes no issue, but catch-up subscription does. This points to a memory/caching issue for me.

The original bugreport concerned not 'I'm-getting-incorrect-results' (it could still be corrupt input data even though I can't figure out how - I will continue running tests with new data sets) but the fact that eventually Streams start getting processed in the wrong order. This is something that I keep observing, including in the latest test run with only a single Partition: This starts occuring suddenly near the end of the batch, when for some reason the globalposition skips some 300 positions ahead and starts processing backwards:

edit: I should also add that I am not observing this issue until after a million events have been processed (in memory, by a single running application). This could be relevant because I previously tested with smaller test sets (including one that stopped at a million events) and did not observe this issue.

ReinderReinders commented 1 year ago

edit: --deleted--

Sorry, thought I found something in the Eventuous source but I was wrong.

ReinderReinders commented 1 year ago

Update: I've run a new test set with 200.000 Entities, each having 5 events (1 create and 4 updates). The result: 19424 errors in the log (might be I'm logging each error twice). Nearly 14.000 Entities were not synced at all, nearly 3700 incorrect states at the end. This was a run with a partitioncount of 1. The first error again occurs near the end (globalposition 925824). The last error does not occur at the end of the stream (991050 is the last error recorded - so after this about 9000 more events are processed without a hitch. So roughly between 92% and 99% of the batch, errors occur).

Why does a testset with double the streams (and half the events per stream) significantly increase the problem?

ReinderReinders commented 1 year ago

@alexeyzimarev Stop the clock! I have discussed this issue with my colleagues and they have (probably) found the issue. I am going to run another batch to test our solution and then I will (likely) offer you a Pull Request.

ReinderReinders commented 1 year ago

@alexeyzimarev I would like to take the next step in my IT career and become an Open Source contributor :) by opening a PR fixing this issue. It's my first time on github... do you have the ablility to give me permissions so I can push to this Repository? (I have the changes locally but can't push).

alexeyzimarev commented 1 year ago

You need to fork the repository and then you make a PR. That's the common way to contribute on GitHub. You don't need repository permissions to contribute.

ReinderReinders commented 1 year ago

https://github.com/Eventuous/eventuous/pull/180 Here we go! You were correct, it was in issue in the paging query.

ReinderReinders commented 1 year ago

Closing issue, this fix has resolved it.

alexeyzimarev commented 1 year ago

Good work 👍

Eventuous / eventuous

On heavy load, the PartitioningFilter suddenly flips event processing order #174