Azure / iotedge

The IoT Edge OSS project
MIT License
1.45k stars 458 forks source link

edgeHub failing due to 'sdk hanging' #6578

Closed DarthLowen closed 2 years ago

DarthLowen commented 2 years ago

Expected Behavior

Our setup is the following: industrial equipment sends out 2 UDP packets every second, an IoT edge server is the gateway for these machines towards the cloud. In larger networks with (for example) 200 machines, edgeHub uses a large amount of processing power and fails after some time with an error, explaining the SDK is hanging. (see logs below). We have several containers running, 1 container receives the UDP packets (400/sec, through logging detected that we do in fact receive and parse all these UDP packets) and forwards them to edgeHub (using ModuleClient.SendEventAsync), routes are in place to forward these packets to 2 other containers, generating in total probably 1200 messages/sec. This runs on an industrial PC with ubuntu 18.04, quad-core intel celeron processor and 4GB RAM. I would expect this to be sufficient for 1200 messages/second. All code in our modules is written in .net core 3.1

Current Behavior

The container that sends the messages takes up all memory after some time (after a few hours), edgeHub container uses 200% CPU (according to top) and after a while, the edgeHub module fails with mention of 'ProtocolGatewayException', 'Error due to sdk hanging and upstream call timed out', 'EdgeHubCloudSDKException', ... and Mayhem ensues... Also, the container that sends the messages happily keeps on sending data (and taking up all the memory), it does not seem to detect the ModuleClient connection is down.

Steps to Reproduce

Right now, this occurs at our customers and in testing environments, there is no easy way to reproduce Run the test solution as provided below in comment

Context (Environment)

Output of iotedge check

iotedge check does not yield any warnings nor errors

Device Information

Runtime Versions

Note: when using Windows containers on Windows, run docker -H npipe:////./pipe/iotedge_moby_engine version instead

Logs

[iotedged-log.txt](https://github.com/Azure/iotedge/files/9287539/iotedged-log.txt) aziot-edged logs
edge-agent logs [edgeAgent-log.txt](https://github.com/Azure/iotedge/files/9287550/edgeAgent-log.txt)
edge-hub logs [edgeHub-log.txt](https://github.com/Azure/iotedge/files/9287547/edgeHub-log.txt)

Additional Information

I removed most of the irrelevant logs in the edgeHub, The interesting messages start at 11:44:26 when the edgehub starts reauthenticating. It doesn't seem to recover from that.

DarthLowen commented 2 years ago

PS: limiting the memory of the container that sends the messages, doe not cause garbage collection to occur sooner, it just kills the process because it is out of memory. This is a nice fail-safe, but not intended behaviour

DarthLowen commented 2 years ago

I have created a test solution with 2 modules, 1 sender module that sends 400 messages per second (256 bytes) and 2 receiver modules (2 containers based on same image) that receive them. I added some Console.WriteLines showing that the system is not capable of sending/receiving this amount of messages. This was tested in a simulator (Ubuntu 20.04 vmware image) MessageFloodSim.tgz

Can you confirm that I'm reaching the limit of this setup? 400 messages/sec is too much?

nyanzebra commented 2 years ago

Hello @DarthLowen

The 'sdk hanging' error message is due to a workaround of an SDK error (edge uses the same C# SDK to communicate to IoTHub as you use from your modules). As you can see from the logs, you use mqtt to talk to edge hub, however you use amqp to communicate upstream (between edge and iot hub) - that is usually ok, but the 'sdk hanging' appears with higher odds using amqp.

Is it possible to switch to mqtt upstream and see how it works, that way maybe you can avoid running into the sdk problem.

Regarding the 400msg/sec, it is higher than what people usually try with edge. Edge's main purpose wasn't performance when it was designed (but to mirror IoTHub for offline scenarios and to do message preprocessing with modules). Yet, it should be ok with 400msg/sec on a normal pl. Let us take a look at your test.

nyanzebra commented 2 years ago

So, I tried to replicate and things seem to be working. Let me know if I have something different, below is my info from testing:

Version: iotedge 1.3.0

Status: image

Specs: image

Top: image

DarthLowen commented 2 years ago

Hi @nyanzebra,

Thanks for testing, however the SDK issue I think is indeed not present in this demo. The modules I created were more for the missing packets/performance issue. I agree it's confusing, I'm starting off with an SDK exception and I'm supplying modules to test something else. To me these items have always been linked (which I understood they're not because it would be an AMQP related-thing?). I'll see if I can add a route to $upstream and send some packets to the iothub (we have 1 packet every 3s going upstream), according to your theory that should trigger the SDK exception then? Perhaps you would like me to open a separate issue for the missing packets/performance? Let me know.

EDIT: I updated the sources, Receiver1 now sends messages upstream, this is running right now, I'll keep you posted. MessageFloodSimWithUpstream.tgz

What these modules do is:

This leads me to believe the edgeHub can't handle this message load (which I then thought was the cause of the SDK hang issue)

Could you perhaps share the logs of the Receiver modules? Here are mine: image As you can see, the sender module is not even capable of sending 400 messages every second, it takes around 5 seconds!

For reference, my lscpu and top output: image

image

nyanzebra commented 2 years ago

@DarthLowen,

Yes, the exceptions you see are likely related to AMQP. Would you also mind testing module to module message rate with MQTT instead of AMQP if not already? This might yield faster message transfer.

In the meantime, I will test out your new example.

DarthLowen commented 2 years ago

@nyanzebra, the modules already use MQTT, see the init code below. Only the upstream is AMQP. This small test project actually failed in the meantime btw, just came back from holiday and had 22 mentions of the error in the edgeHub log.

This is the init code:

MqttTransportSettings mqttSetting = new MqttTransportSettings(TransportType.Mqtt_Tcp_Only);
ITransportSettings[] settings = { mqttSetting };
// Open a connection to the Edge runtime
ioTHubModuleClient = await ModuleClient.CreateFromEnvironmentAsync(settings);

I am planning to completely refactor our code and either send messages in bulk instead of one per one OR have a completely separate communication path between our custom modules, bypassing edgeHub. We're actually loading up the edgeHub with messages that it doesn't have anything to do with.

nyanzebra commented 2 years ago

@DarthLowen, yes batching messages might help with rate of message transfer. The maximum is 256MiB for payload of publish in MQTT (if I recall correctly).

nyanzebra commented 2 years ago

@DarthLowen, did batching and using MQTT for upstream resolve your problems?

DarthLowen commented 2 years ago

@nyanzebra, I'm confident it will, we haven't had the time yet to experiment with that though, we're really short on staff right now. The plan is to move away our inter-module communication from edgeHub but keep the 'real-time' behaviour of the packets (so no batching) since we do rely on it for some very limited functionalities. I'll keep MQTT in the back of my mind if we should encounter the hang issue again when the refactor is complete. The small project I made can trigger the sdk hang, so maybe you can use that to find & fix the problem, but for now this is not a pressing issue for us anymore. Do you propose we close this ticket?

nyanzebra commented 2 years ago

@DarthLowen, yes if batching and using mqtt solved your problem we can close ticket. Please reopen ticket if problem persists :)

DarthLowen commented 2 years ago

Just an fyi, I implemented RabbitMQ as a message broker between our own modules, resulting in 80% less CPU usage, 50% less memory and no dropped packets. Bottom-line: don't use edgeHub as a message-broker for messages that have nothing to do with iot hub or leaf devices...