Azure / iotedge

The IoT Edge OSS project
MIT License
1.45k stars 457 forks source link

IoT Edge produces too much traffic with default configuration, it is not conducive for billable networks such as cellular. #7252

Closed uriel-kluk closed 1 month ago

uriel-kluk commented 3 months ago

Hello IoT Edge team,

I hope you can help us and the community to fine-tune our iot edge design for billable networks. To give you perspective, our IoT device sends compressed and batched telemetry on periodicity and on event changes. This application, running stand-alone using MQTT and device SDK consumes less than 1MB a day.

We had good reasons to add containers and more features, and ported the application to a Linux constrained device, running IoT Edge with a cellular modem. but now the bill from telco is 12x

After some exploration, we identified some possible optimizations, but we would love to get best practices and recommendations from you:

  1. Amqp appears to be the recommended protocol in documentation because $upstream multiplexing capability. But the ModuleClient in the custom modules runs thinner using Mqtt. Is there a comparison to illustrate when to mix and match Transport protocols? And if we use marketplace modules, is there a way to define the channel or at least, read it?

  2. Could you help figure out what variable controls the following traces and how these affect traffic: 2.1. Done refreshing device scope identities cache. Waiting for 60 minutes. 2.2. Entering periodic task to reauthenticate connected clients. //this shows every 5 minutes 2.3. Started task to cleanup processed and stale messages for endpoint test_

  3. I am surprised to see messages like the following, but this might be our code, and unrelated to traffic Cleaned up 516 messages from queue for endpoint test_<route> and 516 messages from message store.

  4. Could you explain how ConfigRefreshFrequencySecs works? why is it set by default every hour if the edge agent subscribes for property changes?

  5. What is the advantage to set CacheTokens to true?

  6. Do you have any other advice to reduce downlink traffic if the application is mainly a telemetry pump?

Here are more configuration variables `"edgeAgent": { "env": { "SendRuntimeQualityTelemetry": { "value": false }, "ConfigRefreshFrequencySecs": { "value": 86400 }, "DisableDeviceAnalyticsMetadata": { "value": true }, "MetricsEnabled": { "value": false }, "UpstreamProtocol": { "value": "Mqtt" }

"edgeHub": { "env": { "OptimizeForPerformance": { "value": false }, "HttpSettingsEnabled": { "value": false }, "CloudConnectionIdleTimeoutSecs": { "value": 7200 }, "ConfigRefreshFrequencySecs": { "value": 86400 }, "RuntimeLogLevel": { "value": "info" }, "CacheTokens": { "value": true }, "UpstreamProtocol": { "value": "Mqtt" }, "AmqpSettingsEnabled": { "value": false } `

nyanzebra commented 3 months ago

@varunpuranik / @veyalla do we have any customer facing documentation on the performance breakdowns of mqtt vs amqp?

  1. This might be marketplace module specific, do you have a specific set of modules in mind? For example, the SimulatedTemperatureSensor can have a custom route defined.
  2. Believe those messages correspond to device token refreshes for connections, are you saying you want control over that frequency?
  3. The queue cleanup should be after messages are sent and doing a batch cleanup of messages from underlying store. Can double check this, but think this happens every so often to clean up state from the underlying database.

For 4,5,6 @vipeller or @varunpuranik any guidance on these?

uriel-kluk commented 3 months ago

Thanks @nyanzebra,

Believe those messages correspond to device token refreshes for connections, are you saying you want control over that frequency?

My only motivation to ask about token refreshnes is to reduce traffic. I see a periodic trace every 5 minutes reauthenticating connected clients. I hope this is not creating cellular traffic.

But the main question remains the same, what is causing downloads, if the application is just pumping telemetry? image

nyanzebra commented 3 months ago

@uriel-kluk is the graph for network traffic to/from device? There should be occasional twin data sent to device to see if any changes in modules need to happen as well as some connectivity checks, but would be surprised if it is significant. I think the next steps are to dig deeper.

If you have metrics it will be interesting to see what the following say:

Additionally, if you can provide a support bundle that will be useful.

uriel-kluk commented 3 months ago

Hi Robert,

here is a chart that shows the docker networkstats​ by container.

image

Clearly the traffic comes from edgeHub, in this ilustration, SimulatorModule is just reporting telemetry periodically. and you can see the line follows along and the delta is almost flat.

But edgeAgent​ had a spike of 35K.

I am going to move all the heads to AMQP to see if I gain anything. But I am almost sure it has to do with authentication.

Note: The edge device uses x509 to authenticate with iothub, and the modules use SasTokens

 var mqttSetting = new MqttTransportSettings(TransportType.Mqtt_Tcp_Only)
 {
     KeepAliveInSeconds = 5 * 60
 };

 var options = new ClientOptions
 {
     SasTokenTimeToLive = TimeSpan.FromHours(31),
     SasTokenRenewalBuffer = 50,
     ModelId = ModelId,
 };

 ITransportSettings[] settings = { mqttSetting };

 // OpenAsync a connection to the Edge runtime
 _client = await ModuleClient.CreateFromEnvironmentAsync(settings, options);

 if (_client == null)
 {
     Log.LogError("Failed to connect to IoT Hub");
     return;
 }

From: Robert Jang @.> Sent: Thursday, March 28, 2024 12:04 PM To: Azure/iotedge @.> Cc: Uriel Kluk @.>; Mention @.> Subject: Re: [Azure/iotedge] IoT Edge produces too much traffic with default configuration, it is not conducive for billable networks such as cellular. (Issue #7252)

@uriel-klukhttps://github.com/uriel-kluk is the graph for network traffic to/from device? There should be occasional twin data sent to device to see if any changes in modules need to happen as well as some connectivity checks, but would be surprised if it is significant. I think the next steps are to dig deeper.

If you have metricshttps://learn.microsoft.com/en-us/azure/iot-edge/how-to-access-built-in-metrics?view=iotedge-1.4 it will be interesting to see what the following say:

Additionally, if you can provide a support bundle that will be useful.

— Reply to this email directly, view it on GitHubhttps://github.com/Azure/iotedge/issues/7252#issuecomment-2025702903, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJ2NCA652ENCF5XPDNSVAETY2REQJAVCNFSM6AAAAABFLGYH7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRVG4YDEOJQGM. You are receiving this because you were mentioned.Message ID: @.***>

uriel-kluk commented 3 months ago

Hi @nyanzebra ,

Attached support bundle.

Please look at this chart:

image

It is clear that the edgeHub is renogotiating credentials every hour. Is there a way to control the frequency? Identity attestation uses x.509 in this illustration.

Also, the custom module is doing authentication every hour. this is less impactful but yet it creates traffic. This is the code I am using to connect the custom module with the edgeHub: edge-device-support-bundle.tar.gz

ITransportSettings[] GetTransportSettings()
            {
                switch (transportType)
                {
                    case TransportType.Mqtt:
                    case TransportType.Mqtt_Tcp_Only:
                        return new ITransportSettings[] { new MqttTransportSettings(TransportType.Mqtt_Tcp_Only) };
                    case TransportType.Mqtt_WebSocket_Only:
                        return new ITransportSettings[] { new MqttTransportSettings(TransportType.Mqtt_WebSocket_Only) };
                    case TransportType.Amqp_WebSocket_Only:
                        return new ITransportSettings[] { new AmqpTransportSettings(TransportType.Amqp_WebSocket_Only) };
                    default:
                        return new ITransportSettings[] { new AmqpTransportSettings(TransportType.Amqp_Tcp_Only) };
                }
            }

            var settings = GetTransportSettings();
            var options = new ClientOptions
            {
                ModelId = ModelId,
                SasTokenTimeToLive = TimeSpan.FromDays(1),
            };

            // OpenAsync a connection to the Edge runtime
            _moduleClient = await ModuleClient.CreateFromEnvironmentAsync(settings, options);
nyanzebra commented 3 months ago

@uriel-kluk would you mind trying to set DeviceScopeCacheRefreshRateSecs to a much larger number (default is 3600s or 1hr). I am hoping this reduces the download quite a bit, it also shouldn't effect edgeHub really.

uriel-kluk commented 3 months ago

Thank you @nyanzebra, I tried almost every parameter, but I am getting old and lazy, and because I saw the leaf description, I dismissed. Fingers crossed, this solves the issue. I should be able to provide an update in about an hour :)

Last few days with the default one hour: image

david-emakenemi commented 3 months ago

@uriel-kluk did the solution that Robert provided solve the issue?

claya75 commented 3 months ago

@david-emakenemi @nyanzebra Unfortunately, no. Thanks for all the great input on keeping our IoT Edge traffic lean on cellular. I've been digging into this with @uriel-kluk and with Wireshark I found some spots where we're using more bandwidth than we'd like or expect. I've attached some captures and charts to help illustrate.

We would love to get your take on understanding why we'd be getting so much Rx traffic on edgeHub as shown below, and/or how to reduce it. 2024-04-08 12 14 44

This screenshot is of a Wireshark capture with all Tx traffic filtered out, so this is only Rx traffic coming into our edge device, viewed both in bytes downloaded over time and the delta time between packets over time. Rx only--bytes and delta time

This is a typical 24 hour period reported with our SuperSIM from Twilio. image

Capture files (.pcapng) both for 2.5hrs, one with TCP and TLS only filtered. captures.zip

nyanzebra commented 3 months ago

@claya75 & @uriel-kluk

Looking at the attached pcap, it looks like packets regarding certs are few and while large, aren't the majority of downloads.

When looking further it looks like there is a consistent traffic being sent from 20.49.109.144, which I assume to be your iothub, of about 135 length (TLS record len 64). Given the size, these are likely not PUBACKs within MQTT.

I will try to reproduce by setting up a local device and see if can monitor traffic w/o TLS to see if can get some answers. Will also ask IoTHub team if they have any ideas.

uriel-kluk commented 3 months ago

Thank you for Robert looking deeper.

We are changing the runtime settings as follows:

"systemModules": { "edgeAgent": { "type": "docker", "env": { "SendRuntimeQualityTelemetry": { "value": false }, "ConfigRefreshFrequencySecs": { "value": 86400 }, "DisableDeviceAnalyticsMetadata": { "value": true }, "MetricsEnabled": { "value": false }, "RuntimeLogLevel": { "value": "info" }, "UpstreamProtocol": { "value": "Mqtt" } }, "settings": { "image": "mcr.microsoft.com/azureiotedge-agent:DEFAULT_RT_IMAGE", "createOptions": { "LogConfig": { "Type": "json-file", "Config": { "max-size": "1m", "max-file": "3" } } } } }, "edgeHub": { "type": "docker", "status": "running", "restartPolicy": "always", "env": { "UpstreamProtocol": { "value": "Mqtt" }, "OptimizeForPerformance": { "value": true }, "HttpSettingsEnabled": { "value": false }, "AmqpSettingsEnabled": { "value": false }, "CloudConnectionIdleTimeoutSecs": { "value": 86400 }, "ConfigRefreshFrequencySecs": { "value": 86400 }, "RuntimeLogLevel": { "value": "info" }, "DeviceScopeCacheRefreshRateSecs": { "value": 86400 }, "MinTwinSyncPeriodSecs": { "value": 86400 }, "ConnectivityCheckFrequencySecs": { "value": 300 } }, "settings": { "image": "mcr.microsoft.com/azureiotedge-hub:DEFAULT_RT_IMAGE", "createOptions": { "HostConfig": { "PortBindings": { "5671/tcp": [ { "HostPort": "5671" } ], "8883/tcp": [ { "HostPort": "8883" } ], "443/tcp": [ { "HostPort": "443" } ] } } } } } },


From: Robert Jang @.> Sent: Tuesday, April 9, 2024 11:16 AM To: Azure/iotedge @.> Cc: Uriel Kluk @.>; Mention @.> Subject: Re: [Azure/iotedge] IoT Edge produces too much traffic with default configuration, it is not conducive for billable networks such as cellular. (Issue #7252)

@claya75https://github.com/claya75 & @uriel-klukhttps://github.com/uriel-kluk

Looking at the attached pcap, it looks like packets regarding certs are few and while large, aren't the majority of downloads.

When looking further it looks like there is a consistent traffic being sent from 20.49.109.144, which I assume to be your iothub, of about 135 length (TLS record len 64). Given the size, these are likely not PUBACKs within MQTT.

I will try to reproduce by setting up a local device and see if can monitor traffic w/o TLS to see if can get some answers. Will also ask IoTHub team if they have any ideas.

— Reply to this email directly, view it on GitHubhttps://github.com/Azure/iotedge/issues/7252#issuecomment-2045586299, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJ2NCAZD4ZT5TGQN6UAP2MDY4QH5HAVCNFSM6AAAAABFLGYH7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBVGU4DMMRZHE. You are receiving this because you were mentioned.

ryanwinter commented 1 month ago

@uriel-kluk is this still an outstanding issue?

@nyanzebra do you have any update here?

uriel-kluk commented 1 month ago

We opened a support ticket and we can close the current issue. Thanks