Open adamwhitneysoftwire opened 1 year ago
@robin-leach, is it maybe possible on your side to break the S0491 messages, which have the size of 17-25 MBs, into smaller messages? We are still facing this issue and it is very critical, as not only do we get errors, but also these messages may delay other messages.
Hi @s17b2-voroneckij,
We are aware of the issues around the size of S0491, and so on Friday 3rd November we deployed a change to no longer send these files out over IRIS. They will still be available over the API. Hopefully that helps with the delays you are experiencing.
@robin-leach, thanks for your answer. We didnt face any issues indeed since Friday. However, I opened an issue about this problem in the repository of the Azure Java SDK: https://github.com/Azure/azure-sdk-for-java/issues/37497 And it seems like the issue doesn
t happen when the transport is set to TransportType.AmqpOverWebsocket. Not sure why.
@s17b2-voroneckij Thanks for the information, that's very useful to know! Our plan here is generally to approach this by minimizing the message size as much as possible, thus removing the need for other mitigation. Currently we don't expect messages to get much larger than around 3MB, and if there is a need for future large datasets to go over IRIS we will consider how these can be best divided in order to avoid these kinds of issues.
I've come across exactly the same problem and it remains as-of-yet unsolved for me. I'm still receiving the UOU2T3YW messages form the servicebus.
I've modified the python example to follow the example from azure-servicebus sdk for python. The client then successfully recovers from a timed out message and the message goes out for re-delivery. One stalled message does, however, stall messages behind it and causes all of them to not be properly acknowledged and therefore put out to re-delivery, even though the client will have received the message.
TL;DR:
Potential solutions to the primary, secondary and additional issues are at the bottom of this post.
This issue encompasses and expands on the wider issue causing #9.
Background
The default message lock timeout for the IRIS queues is 1 minute, after which the queue assumes processing failed and unlocks the message. The message is later retried, and this process repeats until the message is successfully set as complete, or 5 retries have been attempted, at which point the message is dead-lettered.
Large datasets (for example UOU2T3YW) can be slow to parse as JSON, as a single file can be over 11MB in size. This can cause the processing to take longer than the lock timeout.
The file should be saved successfully, as the content is already downloaded at the point of delivery. The error is in setting the message as complete, and the unwanted side-effects as a result.
Primary Issue - Client errors
The exact effect depends on the client, but ultimately all three fail to set the message as complete.
.NET Client
The .NET client attempts to set the message complete after processing, which will error due to the lock having been expired.
Relevant code section (
MessageProcessors.cs
from line 35):NodeJS Client
The nodeJS client passes all this responsiblity to the SDK, which automatically sets the messages as completed or abandoned accordingly.
Relevant code sections (
processors/processError.js
line 1;client.js
from line 37):Python Client
The issue is most noticeable with the Python client as the actions are performed explicitly.
Relevant code section (
client.py
from line 67):Detailed logs of this scenario can be seen on #9.
Secondary Issue - Data duplication
This leads on to a secondary issue for all of the clients, which could happen after any failure, not just the lock timeout: duplicate messages are not being de-duplicated.
Currently, all three clients save messages using the dataset name and the current time as reported by the local computer.
When the same message is received again, the filename is unique every time, and therefore saved as a new file.
The client will receive up to 5 copies of this data before the message gets dead-lettered or successfully set as complete.
Relevant code sections
.NET Client,
MessageProcessors.cs
from line 33:NodeJS Client,
processors/processMessage.js
from line 7:Python Client,
client.py
line 39:Potential Solutions
Primary Issue
The primary issue is the lock expiration. Two possible solutions are:
Longer default lock timeout.
Renewal of lock before timeout
Secondary Issue
The secondary issue is duplicate data due to the message retry.
Additional Issues
The Python client abandon logic is unprotected.
The .NET client error logging is misleading.