Reliability issue during failures to write to KDB

jarkaxi commented 1 month ago

Hi,

I've been reading through your code to see how it works. I did spot an issue with the reliability of the process and wanted to highlight this.

In https://github.com/DataIntellectTech/kdb-chronicle-queue/blob/main/Adapter/src/main/java/uk/co/aquaq/kdb/adapter/chronicle/ChronicleToKdbAdapter.java#L215 it does the following:

read messages from chronicle queue and accumulate in kdb envelope
try send to KDB
if this fails now reset the queue state back to where it was before we were reading
if application crashes between 2 and 3 you now have a gap in the messages and no way of detecting that gap

Best,

Jark

jarkaxi commented 1 month ago

To add feedback from Chronicle around ideal setup:

The mechanism I prefer is one of the below: A common use case is for a service to read from a queue and integrate with an external system e.g. a tick DB. For this purpose, you should ensure that the AbstractEvent.eventId or eventTime is written to the tick DB in this case. Now when the integration service is restarted, it can query the tick DB for the last eventTime that it has stored, and start replaying from there/filter events before then.

Alternatively you can use a Chronicle Queue named tailer (StartFromStrategy=NAMED) to record where you were up to in reading the queue. However, this will only work if you can guarantee that events have been committed to the external system if your method reader returns without an exception - see below.

If your integration service is reading from a Chronicle Queue using either try (DocumentContext dc = tailer.readingDocument()) { … or with a method reader, then throwing an exception from inside the try block, or from the method reader method, will rollback the read from the queue. The next time you try to read from the tailer you will get the same message. This provides a simple re-try mechanism if the external service is transiently unavailable.

The first mechanism could be used to fix this adapter but you would have to add eventTime to your KDB rows. The second one works well if you send each row individually to kdb, which is not super-efficient, but can work fine

BGillenDI commented 4 weeks ago

Hi @jarkaxi,

We'll be picking this up from today and will assign someone to look into it immediately and hopefully have an update for you soon

Thanks for spotting and the leg work with Chronicle also

Brien

BGillenDI commented 2 weeks ago

Hi @jarkaxi Tin has merged a fix on this last week, I've left the issue open in the hopes you get a chance to take a look before we resolve

jarkaxi commented 1 week ago

Hi @BGillenDI,

That looks like it should work, might be worth noting that you've changed the messaging guarantee form "exactly once" to "at least once" - e.g. KDB could potentially receive the same message more than once - I'm assuming that doesn't have an implication normally?

this is in the case that trySend will send the message to KDB but then it crashes straight after KDB has received & confirmed reception of the message

Tin-Pui commented 1 week ago

Hi @jarkaxi, thanks for pointing this out. You are correct, the fix effectively removes a high chance (higher with larger kdb.envelope.size) that a crash causes dropped messages, but it introduces a low chance (lower with larger kdb.envelope.size) that a crash causes duplicate messages in KDB. We can include the index from the ChronicleQueue that can be sent to KDB optionally as a configuration to help detect this if that is preferable? It would require a schema change in KDB for it to work.

DataIntellectTech / kdb-chronicle-queue

Reliability issue during failures to write to KDB #10