Closed david-fitzgerald-re closed 11 months ago
Having found this comment https://github.com/Azure/iot-identity-service/issues/542#issuecomment-1741029534, I tried setting prefer_module_identity_cache = true
in /etc/aziot/config.toml
and applying the config with both sudo azioctl config apply
and iotedge config apply
.
This did not have any effect on the duration of the get_twin
call. But it did decrease the duration of the create_from_edge_environment
call when I pull the ethernet plug.
Func create_from_edge_environment took 0.037818665005033836
Func get_twin took 65.94500249600969
This doesn't solve my problem but I thought it was interesting. This does kind of beg more questions though, why there is a difference in duration of instantiating the client
object with both approaches (the iptables
approach and the unplugging WAN ethernet approach)
Btw the get_twin
call taking 65 seconds is (I think) because I did not immediately run my python
test code. The get_twin
call seems to take about 80seconds from the moment I physically unplug the WAN connection.
Ah, intersting, there is another configuration parameter in /etc/aziot/config.toml
called cloud_timeout_sec
but cannot be reduced below 70...
sudo aziotctl config apply
Caused by:
TOML parse error at line 4, column 21
|
4 | cloud_timeout_sec = 10
| ^^
cloud_timeout_sec must be at least 70 seconds
Is there really no way to use a cached twin after 10 seconds?
Hey @david-fitzgerald-re thanks for reporting this. I'm looking into it and will update you shortly.
Hey @vadim-kovalyov, any updates? 😄
Hey @david-fitzgerald-re sorry for slow responses, it was holiday week here in US. I'm trying to repro this issue and will update you soon.
No probs @vadim-kovalyov.
If you need anything more from me to help repro please reach out.
Hey @david-fitzgerald-re I took some time for me to setup new env and python module (sorry, I'm not a python guy), but after all I was able to repro the issue with your steps, with exception that I used ip route add prohibit <hub_id>/32
to simulate network down in my box. Using AMQP as upstream protocol I did see the initial delay of 60+ sec for get_twin()
if it coincides with the time when EdgeHub starts to notice disconnect. I'm not yet sure where that delay is coming from, trying to find out. But once EdgeHub "goes offline", the subsequent calls are timed out with CloudOperationTimeoutSecs
interval, and if I update that env var I get faster timeouts (tried with 5 sec and it worked).
I brought this issue up the chain to see if this is something we can fix. But for now, I'd like to learn more about your use case to maybe suggest a workaround. Why do you call get_twin() on disconnect and why long timeout creates a problem? you should still have a previous twin (even if a bit outdated) that you can use, and fetch the new in the background.
Hey @vadim-kovalyov, thanks very much for taking the time to repro! Good to know about the ip route add prohibit
I might have to try that.
So with my use case, the problem is a bit of an edge case but I can describe it like so:
edgeHub
to "realise" it is offline and retrieve the cached twinFor example, imagine that our device restarts whilst it has no internet, edgeHub will wait to see if it can connect to the internet for 60 seconds.
We want our device to be robust and quick to react to offline status including circumstances when the device restarts whilst offline.
This also becomes a problem in testing, because it causes the tests to wait up to 60 seconds. I had written some e2e tests to cut the net connection and see how our modules behave but realised it was blocking when retrieving the twin. I was trying to avoid mocking as much as possible, especially since these are e2e tests. It might be I have to make some allowances in this situation.
The behaviour I would like the client to exhibit could expressed like so:
twin = client.get_twin(cached=True)
or
twin = client.get_twin(timeout=5)
where, in the first example, we instantly get our cached twin, we can then start our app up instantly with that and then in the background make a blocking networked call to get_twin
to the cloud whenever the net connection comes up.
and, in the second example, it would behave similarly but with some small amount of initial blocking before giving up on the net call. I think the former would be more favourable in our use case.
On the topic, do you know if it is possible (or advisable) for us to query the RocksDB ourselves? If we had an authorised query we could just get the twin ourself, though I suspect this is probably neither possible nor advisable.
I see. My quick answer to that problem would be to have a persistent volume for your custom module container and save your representation of twin there, so it can survive the module restart.
We can't recommend querying RocksDB directly.
As I mention in my previous comment, I observed 60+ delay only once right after EdgeHub loses connection, the subsequent calls were bounded by CloudOperationTimeoutSecs
and updating it reduced the timeout. Is this the same as you observed? if so, you can reduce CloudOperationTimeoutSecs
to 5 sec. But for initial 60+ second delay I don't have answer yet.
@vadim-kovalyov thanks for the continued investigation, yes that is the same as we have observed.
I think we will persist our representation of the twin like you suggested. Thanks very much for your help, I think I'm happy to close this issue if you are, we will be able to work something out that fits our use case on our side.
Thanks @david-fitzgerald-re, I'll close this issue for now, but we will keep internal ticket open and continue investigating the 60s delay.
I am trying to improve my module's network resilience. If we have no internet connection, I wish for the cached module twin to be retrieved quickly.
Expected Behavior
At the moment (with no internet connection) it always takes at least a minute for
edgeHub
to give up trying to connect to IoTHub such that it will return a cached module twin.I believe it is related to this line in the
edgeHub
logs:I wish for this to be configurable, say, if you can't get the twin from IoTHub within ~10 seconds, return your cached twin.
I have attempted to configure all of the time-related environment variables possible defined within this file but sadly none of them are causing the desired effect of a shorter fallback to using the cached twin.
We should be able to use environment variables to configure how long it takes for a
client
(apython
IoTHubModuleClient
specifically) to timeout such that aget_twin
call will fallback to a local version of the twin whichedgeHub
caches.Basically, I would like the results to look like this:
Current Behavior
Without an internet connection, it takes this long to retrieve the (cached) twin:
With an internet connection it takes this long to retrieve the (non-cached) twin:
Steps to Reproduce
edgeHub
andedgeAgent
. Additionally add a custom module to the manifest - it can basically be empty, just with python installed.In the custom module, open an
ipython
terminal (python
itself is probably fine)On the docker host, simulate a network outage with the following
iptables
command (inspired by this but adapted with this in mind)Note: this is run on the docker host (on the device) not on the custom module
Immediately after, run this code in the
ipython
shell:Context (Environment)
azure-iot-device 2.12.0
on my moduleOutput of
iotedge check
Click here
``` REDACTED-user@REDACTED-device:~$ sudo iotedge check --iothub-hostname <<Device Information
aarch64
Runtime Versions
1.4.20
1.4.14
1.4.14
24.0.7
Logs
edge-agent logs
``` I decided to remove the `edgeAgent` logs because the GH issue description exceeded the maximum character limit. I have got it saved and can post if requested. ```edge-hub logs
``` <6> 2023-11-13 17:32:41.017 +00:00 [INF] - Successfully generated identity for clientId REDACTED-device/REDACTED-module and username REDACTED-host/REDACTED-device/REDACTED-module/?api-version=2019-10-01&DeviceClientType=azure-iot-device-iothub-py%2F2.12.0%283.8.10%3BLinux%20%231%20SMP%20PREEMPT%20Tue%20Jan%2024%2015%3A09%3A44%20PST%202023%3Baarch64%REDACTED-module-net-test <6> 2023-11-13 17:32:41.017 +00:00 [INF] - "ClientAuthenticated, REDACTED-device/REDACTED-module, 53ce12cb" <6> 2023-11-13 17:32:41.018 +00:00 [INF] - New device connection for device REDACTED-device/REDACTED-module <6> 2023-11-13 17:32:41.018 +00:00 [INF] - Client REDACTED-device/REDACTED-module connected to edgeHub, processing existing subscriptions. <6> 2023-11-13 17:32:41.018 +00:00 [INF] - Bind device proxy for device REDACTED-device/REDACTED-module <6> 2023-11-13 17:32:41.018 +00:00 [INF] - Binding message channel for device Id REDACTED-device/REDACTED-module <6> 2023-11-13 17:32:41.018 +00:00 [INF] - Processing pending subscriptions for REDACTED-device/REDACTED-module <6> 2023-11-13 17:32:41.019 +00:00 [INF] - Processing subscriptions TwinResponse for client REDACTED-device/REDACTED-module. <6> 2023-11-13 17:32:41.019 +00:00 [INF] - Processing pending subscriptions for REDACTED-device/REDACTED-module <6> 2023-11-13 17:32:41.020 +00:00 [INF] - Set subscriptions from session state for REDACTED-device/REDACTED-module <6> 2023-11-13 17:32:50.265 +00:00 [INF] - Error due to sdk hanging and upstream call timed out, closing cloud proxy for REDACTED-device/$edgeHub. Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubCloudSDKException: Operation timed out due to SDK hanging at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudProxy.<>c.<.cctor>b__43_0() in /mnt/vss/_work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudProxy.cs:line 51 at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudProxy.UpdateReportedPropertiesAsync(IMessage reportedPropertiesMessage) in /mnt/vss/_work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudProxy.cs:line 229 <6> 2023-11-13 17:32:50.266 +00:00 [INF] - Exiting connected state <6> 2023-11-13 17:32:50.266 +00:00 [INF] - Entering unreachable state <6> 2023-11-13 17:32:50.267 +00:00 [INF] - Entering disconnected state <6> 2023-11-13 17:32:50.270 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.270 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.270 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/REDACTED-module is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/REDACTED-module changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/REDACTED-module is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/REDACTED-module changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/$edgeHub is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/$edgeHub changed to Disconnected <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Cloud connection for REDACTED-device/REDACTED-module is False <6> 2023-11-13 17:32:50.271 +00:00 [INF] - Connection status for REDACTED-device/REDACTED-module changed to Disconnected <6> 2023-11-13 17:32:50.272 +00:00 [INF] - Retrying cloud proxy operation UpdateReportedPropertiesAsync for REDACTED-device/$edgeHub. System.TimeoutException: The operation timed out. ---> Microsoft.Azure.Devices.Client.Exceptions.IotHubCommunicationException: Transient network error occurred, please retry. ---> System.TimeoutException: The operation has timed out. at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.RoundTripTwinMessageAsync(AmqpTwinMessageType amqpTwinMessageType, TwinCollection reportedProperties, CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.SendTwinPatchAsync(TwinCollection reportedProperties, CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.<>c__DisplayClass27_0.<Additional Information
edgeHub
logs I can see the following line of which I would like to make the00:01:00
configurable:iptables
rules I get a slightly different result, instantiating theclient
takes longer but theget_twin
call takes about the same amount of time:on_connection_state_change
callback and that appears to show that the cached twin is only retrieved when the client is marked as disconnected:How can I trigger the module client to get marked as disconnected sooner than 1min?