Closed Arkatufus closed 3 years ago
@ismaelhamed yes, I think I know why, I'm still investigating it. It'll be a separate issue if I'm sure I've pinpointed the problem.
@Aaronontheweb, @ismaelhamed Need to draft the fix for this issue:
The problem with current implementation: It does the batching as an atomic operation, but does not treat success reporting as an atomic operation, and it does not report any failures.
What is a batch? Is it just a group of similar operation that is executed by a single command, or does it have to be atomic?
So the batching journal cheats, and lumps together multiple operations from different persistent actors into the same "atomic batch" in order to achieve higher overall throughput.
What I think is kind of garbage about the BatchingSqlJournal
is how it groups operations together:
From looking at this - not clear if deletes, reads, and writes can all be included into the same SQL transaction - it certainly looks possible based on how that syntax is written, but whether or not those types of operations would actually appear inside the same chunk together isn't clear. If it's possible that reads and writes can all appear in the same transaction then that is total garbage. Those operations should at least be decoupled by operation type. No reason a recovery should fail because someone violated a unique key constraint on a write.
As for grouping "like kinded" operations together:
If it is just a group of operations, then transaction should be applied per operation, and each success/failure should also be batch sent to each interested parties at the end of the batch.
Everything in the transaction should succeed or fail together. Otherwise there's no point to having "batching" in the first place - the vanilla SQL journal already does the "right" thing in terms of transaction isolation. The point of the BatchingSqlJournal
is to break that to try to get better performance. You're trading isolation guarantees in exchange for reducing total outstanding transactions per-ActorSystem and better throughput.
So I'd modify the BatchingSqlJournal
to break out different operations from each other, because the way it's designed now is nonsense if it's possible for reads, writes, and deletes to be comingled. But I'd make sure that each individual AtomicWrite
that is lumped into a single "batch" transaction all gets failed / rejected / succeeded as a group.
Yes, in the current implementation, write, read (replay / query) and delete can happen in a single transaction.
Yes, in the current implementation, write, read (replay / query) and delete can happen in a single transaction.
Yeah, that's crazy and won't even help with performance. Batching reads and writes together into the same transaction where the same table is being modified seems goofy at best.
Batching reads and writes together into the same transaction where the same table is being modified seems goofy at best.
Although I should note - I may be surprised here if the benchmarks bear out something different, but I don't think reads and writes should be coupled together like this.
Yes, in the current implementation, write, read (replay / query) and delete can happen in a single transaction.
Yeah, that's crazy and won't even help with performance. Batching reads and writes together into the same transaction where the same table is being modified seems goofy at best.
Oh, if you start running persistent actors with aggressive journal sweeping (Think an AtLeastOnceDelivery actor where you're snapshotting every 50 messages and only keeping 2 snaps worth of journal entries) things get nasty pretty quickly on SQL Server. The risk of deadlock contention on a delete firing goes up under heavy writes, and with Batching journal this leads to writes failing, but because of the original title of this issue, yeah lots of haywire journal states can happen.
But agreed, Akka.Persistence.Linq2Db only batches writes (well, it technically batches reads as well, but the way SqlJournal does reads it doesn't really need to as badly) and that's the sane thing to do.
I'm guessing you'll see a perf boost on recovery with such a change, At minimum some improvement from not opening transactions, at best the Parallel recovery tests will see some gains as well (The batching journal does suffer on those.) As for the Premature WriteMessageSuccess
, when that is fixed, I expect the single-Persist numbers of BatchingSqlJournal
to drop down a good bit as a result, Multi-Persist possibly not quite so much.
This was fixed in v1.4.19
In the current
BatchingSqlJournal
implementation, a batch chunk write is processed inside theExecuteChunk
method, which will read each entry inside the chunk and applies a handler method to each of the entries. ForWriteMessages
event, this handler method is theHandleWriteMessage
which, unfortunately, treats each write request inside theWriteMessages
event as a single write request and fires theWriteMessageSuccess
event for each one, while the transaction itself has not been commited yet.HandleWriteMessage
should be changed so that it will do an "all or nothing" write of all of the requests. Any status events should also be fired in an "all or nothing" fashion at the end of the operation and any exception should be re-thrown to the callingExecuteChunk
method so that the transaction can be properly rolled back.