adamwhitneysoftwire commented 1 year ago

TL;DR:

Large datasets lead to message lock timeout during processing.
The message is not set as complete on the queue.
The clients produce errors (and the Python client dies).
The message is retried several times and can lead to data duplication.

Potential solutions to the primary, secondary and additional issues are at the bottom of this post.

This issue encompasses and expands on the wider issue causing #9.

Background

The default message lock timeout for the IRIS queues is 1 minute, after which the queue assumes processing failed and unlocks the message. The message is later retried, and this process repeats until the message is successfully set as complete, or 5 retries have been attempted, at which point the message is dead-lettered.

Large datasets (for example UOU2T3YW) can be slow to parse as JSON, as a single file can be over 11MB in size. This can cause the processing to take longer than the lock timeout.

The file should be saved successfully, as the content is already downloaded at the point of delivery. The error is in setting the message as complete, and the unwanted side-effects as a result.

Primary Issue - Client errors

The exact effect depends on the client, but ultimately all three fail to set the message as complete.

.NET Client

The .NET client attempts to set the message complete after processing, which will error due to the lock having been expired.

This is caught and logged.
The log is misleading, starting with "Unable to create file" (despite that part of the code working fine).
The client continues to the next message.

Relevant code section (MessageProcessors.cs from line 35):

try
{
    await FileDownloadHelpers.WriteFile(path, filename, jsonString);
    await args.CompleteMessageAsync(args.Message);
}
catch (Exception e)
{
    Console.Error.WriteLine($"Unable to create file \"{filename}\":\n{e}");
}

NodeJS Client

The nodeJS client passes all this responsiblity to the SDK, which automatically sets the messages as completed or abandoned accordingly.

The completion of the message will still fail due to the lock having been expired.
The error is passed to the defined error handler, which simply logs the error as reported.
The client continues to the next file.

Relevant code sections (processors/processError.js line 1; client.js from line 37):

const processError = async (error) => console.error(error);

receiver.subscribe({
    processMessage,  // Normal message processing
    processError,  // Error handling incl. complete and abandon not working
});

Python Client

The issue is most noticeable with the Python client as the actions are performed explicitly.

Attempting to complete the message raises an exception.
The Python client attempts to abandon the message upon any exception being raised during processing.
A secondary, uncaught exception is raised (abandon also fails for lock timeout).
This kills the client and no further messages are processed until it is restarted.

Relevant code section (client.py from line 67):

for msg in receiver:
    try:
        save_message(msg, download_directory)
        receiver.complete_message(msg)
        logging.debug(f'Successfully processed message.')
    except Exception as e:
        receiver.abandon_message(msg)
        logging.error(f'Unable to process message. Reason: {e}')

Detailed logs of this scenario can be seen on #9.

Secondary Issue - Data duplication

This leads on to a secondary issue for all of the clients, which could happen after any failure, not just the lock timeout: duplicate messages are not being de-duplicated.

Currently, all three clients save messages using the dataset name and the current time as reported by the local computer.
When the same message is received again, the filename is unique every time, and therefore saved as a new file.
The client will receive up to 5 copies of this data before the message gets dead-lettered or successfully set as complete.

Relevant code sections

.NET Client, MessageProcessors.cs from line 33:

var time = DateTime.Now.ToString("yyyy-MM-ddTHH-mm-ss_fff");
var filename = $"{dataset}_{time}.json";

NodeJS Client, processors/processMessage.js from line 7:

const now = new Date();
const fileName = `${dataset}_${formatDateForFileName(now)}.json`;

Python Client, client.py line 39:

file_name = f'{dataset}_{datetime.now().strftime("%y%m%dT%H%M%S_%f")}.json'

Potential Solutions

Primary Issue

The primary issue is the lock expiration. Two possible solutions are:

Longer default lock timeout.
- ⚠ The maximum is 5 minutes, so will need to investigate if this is long enough for all possible datasets.
Renewal of lock before timeout
- ⚠ This might not be as easy as just turning on an auto-renewer, as the large datasets taking a long time to parse may hog the thread and prevent task-switching to the auto-renewer mid-parse.

Secondary Issue

The secondary issue is duplicate data due to the message retry.

A possible solution is to use enqueued time instead of current time when creating a file name, so the same message creates the same filename.

Additional Issues

The Python client abandon logic is unprotected.

A possible solution is to protect the abandon call, so that it can gracefully continue after failure.

The .NET client error logging is misleading.

A possible solution is to split the try-catch block into two, so that the error logging is more specific to the actual error.

s17b2-voroneckij commented 10 months ago

@robin-leach, is it maybe possible on your side to break the S0491 messages, which have the size of 17-25 MBs, into smaller messages? We are still facing this issue and it is very critical, as not only do we get errors, but also these messages may delay other messages.

robin-leach commented 10 months ago

Hi @s17b2-voroneckij,

We are aware of the issues around the size of S0491, and so on Friday 3rd November we deployed a change to no longer send these files out over IRIS. They will still be available over the API. Hopefully that helps with the delays you are experiencing.

s17b2-voroneckij commented 10 months ago

@robin-leach, thanks for your answer. We didnt face any issues indeed since Friday. However, I opened an issue about this problem in the repository of the Azure Java SDK: https://github.com/Azure/azure-sdk-for-java/issues/37497 And it seems like the issue doesnt happen when the transport is set to TransportType.AmqpOverWebsocket. Not sure why.

robin-leach commented 10 months ago

@s17b2-voroneckij Thanks for the information, that's very useful to know! Our plan here is generally to approach this by minimizing the message size as much as possible, thus removing the need for other mitigation. Currently we don't expect messages to get much larger than around 3MB, and if there is a need for future large datasets to go over IRIS we will consider how these can be best divided in order to avoid these kinds of issues.

MarcinKonowalczyk commented 7 months ago

I've come across exactly the same problem and it remains as-of-yet unsolved for me. I'm still receiving the UOU2T3YW messages form the servicebus.

I've modified the python example to follow the example from azure-servicebus sdk for python. The client then successfully recovers from a timed out message and the message goes out for re-delivery. One stalled message does, however, stall messages behind it and causes all of them to not be properly acknowledged and therefore put out to re-delivery, even though the client will have received the message.

elexon-data / iris-clients

Message lock timeout for large data - Data is duplicated and Python client dies #16