NEventStore / NEventStore.Persistence.SQL

SQL Persistence Engine for NEventStore
MIT License
10 stars 34 forks source link

Possible Azure SQL #14

Open adamfur opened 9 years ago

adamfur commented 9 years ago

Hi,

We have built a system that is using the PollingClient to feed around 25 commit observing projection builders. Occupationally in production we have simultaneously received crashes in several of our projection builders at the same time, at the very same commit checkpoint. When we reset the checkpoint, and feed the projection builders with the entire event stream again, it passes perfectly. So it appears it sometimes skips a few events! It only seem to happen during highly intense read+write periods during imports of data.

SQL Azure's default transaction level is different from vanilla MSSQL.

If I write event+0 and event+1 to the database, is it possible that event+1 is available before event+0 in SQL Azure?

In MSSQL we haven't been able to reproduce this result.

Any clues?

andreabalducci commented 9 years ago

Haven't worker yet with SqlAzure but is possibile to have on a distributed system this kind of issue. You should handle this "glitch" in your client: A) polling client -> sequencer -> projection B) modify the polling client to read with few milliseconds delay

I will go for A

adamfur commented 9 years ago

I've messed around a bit, received transaction exceptions in the polling client while using the EnlistInAmbientTransaction() call during the Wireup(). Not sure if it actually solves anything, but gonna try it out for a few days.

Regarding opt A and B. I think creating a sequencer is difficult, we except a lot of holes in CheckpointNumber identity column, as we are using several buckets, also SqlAzure sometimes "randomly" bumps the identity by +10'000.

Implemented a version of B were we changed ObserveFrom*() to pass UtcNow - 300ms, and ignore all commits newer than that. Giving some time for the infrastructure to catch up.

If we settle for opt B, I will eventually send a pull request with a SqlAzureDialect.

adamfur commented 9 years ago

Our issues:

The workaround:

Notes: 1) If case we stumble upon a gap, the clients will have to tolerate a lag of 5 seconds before their projections are updated (happens like two times a day). 2) Invalid sequence retrieval is treated like a transient error, will retry until the next commit has aged to 5 seconds or more, or if we receive the correct sequence. 3) In our logs we can see that it has taken almost 0.5s before we ultimately receive our expected sequence number.

fschmied commented 8 years ago

I believe this to be caused by READ COMMITTED SNAPSHOT, which seems to be on in Azure SQL by default (https://blogs.msdn.microsoft.com/sqlcat/2013/12/26/be-aware-of-the-difference-in-isolation-levels-if-porting-an-application-from-windows-azure-sql-db-to-sql-server-in-windows-azure-virtual-machine/) and is incompatible with NES.

I wonder if creating an AzureSqlDialect using READCOMMITTEDLOCK would have resolved the issue (if that works on Azure SQL).

fschmied commented 6 years ago

I just did a bit of experimentation that showed adding the WITH (READCOMMITTEDLOCK) table hint to NEventStore's queries would probably solve the problem observed by @adamfur as it reintroduces the blocking behavior of the polling client normally seen under SQL Server, but lost under Azure SQL.

fschmied commented 5 years ago

We've created a subclass of MsSqlDialect that adds the READCOMITTEDLOCK table hint for Azure SQL, and it seems to fix the main problem.

What remains is the very low likelihood of https://github.com/NEventStore/NEventStore.Persistence.SQL/issues/21 occurring, but I think noone has actually ever seen this in production.