EventStore / EventStore-Client-Dotnet

Dotnet Client SDK for the Event Store gRPC Client API written in C#
Other
147 stars 38 forks source link

Regression: System.IndexOutOfRangeException since upgrading EventStore library #322

Open Skyppid opened 2 months ago

Skyppid commented 2 months ago

Describe the bug After upgrading from 23.2.1 to 23.3.5 we suddenly experienced thousands of errors on our service. Basically every message that is attempted to be written to EventStore failed to do so resulting in a lot of data loss.

I tried figuring out what the cause would be since the code wasn't touched at all. Sentry shows that the issue stems from library internal methods so likely something broke with that.

image

Since EventStore unfortunately does not provide any symbols I cannot provide more information. Maybe you want to offer these for future releases to make things easier for all of us?

To Reproduce Steps to reproduce the behavior:

This is all the code we use to store the event. Worked like a charm ever since we developed our service months ago. Only broke after the package upgrade:

public async Task StoreEventAsync(ChangeEventMessage @event, CancellationToken cancellationToken)
{
    await client.AppendToStreamAsync(@event.SourceEntityId.ToString(), StreamState.Any,
        [new EventData(Uuid.NewUuid(), nameof(ChangeEventMessage), JsonSerializer.SerializeToUtf8Bytes(@event, jsonOptions.Value.JsonSerializerOptions))], null, null, null, cancellationToken);
}

Expected behavior No exception caused by the event store library itself

Actual behavior IndexOutOfRangeException in a context that does not make sense outside the library.

Config/Logs/Screenshots See screenshot above

EventStore details

Additional context I cannot provide an isolated project to reproduce this issue as of now. Interestingly this error occurs on both staging and production clusters but I cannot reproduce it locally after importing a backup of the production event store. Chances are that this might be an issue because on clusters it's running on Linux while I am developing on Windows.

Maybe you can identify the possible cause by looking at the diffs, something must have changed during these versions...

EDIT: I can verify after downgrading to 23.2.1 it's working again. Not the first time EventStore library upgrades caused issues. Please consider working on your reliability. This should really not happen for enterprise level software.

w1am commented 2 months ago

Hi @Skyppid

Thank you for submitting the issue. We will review it internally and get back to you.

YoEight commented 2 months ago

Hey @Skyppid

This is the system that I use:

OS: Ubuntu 24.04LTS 64bits ESDB server: v24.2.0 (linux tarball) ESDB Client: v23.3.5

This is the code I use:

using System.Text.Json;
using System.Text.Json.Nodes;
using EventStore.Client;

var settings = EventStoreClientSettings.Create("esdb://127.0.0.1:2113?tls=false");
var client = new EventStoreClient(settings);

List<EventData> data = [new EventData(Uuid.NewUuid(), "foobar", JsonSerializer.SerializeToUtf8Bytes(
    new JsonObject {
        ["foobar"] = 42
    }))];

await client.AppendToStreamAsync("foobar", StreamState.Any, data, null, null, null, new CancellationToken());

I was unabled to reproduce your issue. Did I miss something on my end? If not, could you try it in an isolated project to see if you are still reproducing the issue?

Skyppid commented 2 months ago

@YoEight As usual it's not easily reproducable. Problem with these scenarios is that by reproducing it in a small app most of the time you eliminate the probable causes be cause you never use exactly the same data and often have different circumstances. Like I said, locally it did work for me - but on both clusters it failed with identical errors for each event written.

By that I assume even though the parameter for the data is IEnumerable and thus it could be that the actual error lies in serialization as it's invoked through the library code, it's unlikely due to the fact that any data fails and downgrading ES library fixes the problem.

Problem is it's hard to trace. Stacktrace tells that AppendInternal is the culprit but that method is not very small so without symbols it's really just a guessing game what's going wrong. That would help immensely finding a more concrete reason to why the exceptions are happening all of the sudden.

YoEight commented 2 months ago

@Skyppid we might have found the reason why you are experimenting that issue. The newest client added support for OpenAPI and we think the culprit might be here:

https://github.com/EventStore/EventStore-Client-Dotnet/blob/09fcaa64ea5dc042ef37f6f5e0f7f934b594830d/src/EventStore.Client/Common/Diagnostics/ActivityTagsCollectionExtensions.cs#L18

Are each cluster node configured so they can be addressed by their domain name only (without explicitly mention the port those are listening to)?

Skyppid commented 2 months ago

@Skyppid we might have found the reason why you are experimenting that issue. The newest client added support for OpenAPI and we think the culprit might be here:

https://github.com/EventStore/EventStore-Client-Dotnet/blob/09fcaa64ea5dc042ef37f6f5e0f7f934b594830d/src/EventStore.Client/Common/Diagnostics/ActivityTagsCollectionExtensions.cs#L18

Are each cluster node configured so they can be addressed by their domain name only (without explicitly mention the port those are listening to)?

That sounds good! We connect using the following basic CS: esdb://{User}:{Password}@{Host}:{Port}.

var settings = EventStoreClientSettings.Create(builder.Configuration.GetConnectionString(CommonConnectionStringIdentifiers.EventStore) ??
                                                   throw new ConfigurationException("EventStore connection string is not configured.", "ConnectionStrings"));
    EventStoreClient client = new(settings);

So looks like that's not the cause, but maybe something similar?

YoEight commented 1 month ago

Hey @Skyppid,

Sorry for the delay. I was never able to reproduce the issue based on what I told you before. I looked at it differently and didn't find anything. Can you provide the exact connection string that you are using? You can mock the server(s) info of course.