Azure / iotedge

The IoT Edge OSS project
MIT License
1.46k stars 461 forks source link

edgehub does not reconnect through AMQPS after being offline for a few days #6421

Open emilm opened 2 years ago

emilm commented 2 years ago

Expected Behavior

It should reauthenticate through TPM when booting

Current Behavior

It fails to get new token in edgeHub

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. boot up and let it run
  2. take it offline by powering it off
  3. wait 3+ days / artificially adjust the date to in the past
  4. boot it up and wait

Context (Environment)

Embedded

Output of iotedge check

Click here ``` Configuration checks -------------------- √ config.yaml is well-formed - OK √ config.yaml has well-formed connection string - OK √ container engine is installed and functional - OK × config.yaml has correct hostname - Error config.yaml has hostname but device reports hostname . Hostname in config.yaml must either be identical to the device hostname or be a fully-qualified domain name that has the device hostname as the first component. √ config.yaml has correct URIs for daemon mgmt endpoint - OK √ latest security daemon - OK √ host time is close to real time - OK √ container time is close to host time - OK √ DNS server - OK √ production readiness: certificates - OK ‼ production readiness: logs policy - Warning Container engine is not configured to rotate module logs which may cause it run out of disk space. Please see https://aka.ms/iotedge-prod-checklist-logs for best practices. You can ignore this warning if you are setting log policy per module in the Edge deployment. √ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK √ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK Connectivity checks ------------------- √ host can connect to and perform TLS handshake with DPS endpoint - OK √ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK √ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK √ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK √ container on the default network can connect to IoT Hub AMQP port - OK √ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK √ container on the default network can connect to IoT Hub MQTT port - OK √ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK √ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK √ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK 21 check(s) succeeded. 1 check(s) raised warnings. 1 check(s) raised errors. ```

Device Information

Runtime Versions

Server: Engine: Version: 19.03.15 API version: 1.40 (minimum version 1.12) Go version: go1.14.15 Git commit: 420b1d36250f9cfdc561f086f25a213ecb669b6f Built: Wed Jul 21 14:23:45 2021 OS/Arch: linux/arm Experimental: true containerd: Version: v1.2.14-3-g3b3e9d5f6.m GitCommit: 3b3e9d5f62a114153829f9fbe2781d27b0a2ddac.m runc: Version: 1.0.0-rc8 GitCommit: 425e105d5a03fabd737a126ad93d62a9eeede87f-dirty docker-init: Version: 0.18.0 GitCommit: fec3683-dirty


## Logs

<details>
<summary>aziot-edged logs</summary>

Jun 07 13:14:04 iotedged[657]: 2022-06-07T13:14:04Z [ERR!] - server connection error: (unknown) Jun 07 13:14:04 iotedged[657]: 2022-06-07T13:14:04Z [ERR!] - error writing a body to connection: Broken pipe (os error 32) Jun 07 13:14:14 iotedged[657]: 2022-06-07T13:14:14Z [INFO] - [mgmt] - - - [2022-06-07 13:14:14.652387057 UTC] "GET /modules?api-version=2020-07-07 HTTP/1.1" 200 OK 5263 "-" "-" auth_id(-) Jun 07 13:14:17 iotedged[657]: 2022-06-07T13:14:17Z [INFO] - [work] - - - [2022-06-07 13:14:17.904619514 UTC] "POST /modules/%24edgeAgent/genid/637714546228305949/sign?api-version=2018-06-28 HTTP/1.1" 200 OK 57 "-" "-" auth_id(-) Jun 07 13:14:18 iotedged[657]: 2022-06-07T13:14:18Z [INFO] - [mgmt] - - - [2022-06-07 13:14:18.231480021 UTC] "POST /device/reprovision?api-version=2020-07-07 HTTP/1.1" 200 OK - "-" "-" auth_id(-) Jun 07 13:14:18 iotedged[657]: 2022-06-07T13:14:18Z [INFO] - [work] - - - [2022-06-07 13:14:18.795788386 UTC] "POST /modules/%24edgeAgent/genid/637714546228305949/sign?api-version=2018-06-28 HTTP/1.1" 200 OK 57 "-" "-" auth_id(-) Jun 07 13:14:19 iotedged[657]: 2022-06-07T13:14:19Z [INFO] - [mgmt] - - - [2022-06-07 13:14:19.120883185 UTC] "POST /device/reprovision?api-version=2020-07-07 HTTP/1.1" 200 OK - "-" "-" auth_id(-) Jun 07 13:14:24 iotedged[657]: 2022-06-07T13:14:24Z [INFO] - [mgmt] - - - [2022-06-07 13:14:24.690852211 UTC] "GET /modules?api-version=2020-07-07 HTTP/1.1" 200 OK 5263 "-" "-" auth_id(-) Jun 07 13:14:34 iotedged[657]: 2022-06-07T13:14:34Z [INFO] - [mgmt] - - - [2022-06-07 13:14:34.729423980 UTC] "GET /modules?api-version=2020-07-07 HTTP/1.1" 200 OK 5263 "-" "-" auth_id(-) Jun 07 13:14:44 iotedged[657]: 2022-06-07T13:14:44Z [INFO] - [mgmt] - - - [2022-06-07 13:14:44.770494473 UTC] "GET /modules?api-version=2020-07-07 HTTP/1.1" 200 OK 5263 "-" "-" auth_id(-)

</details>

<details>
<summary>edge-agent logs</summary>

2022-06-07T13:13:04,343Z [.business.LoadReporterImpl-0] ERROR .utils.TimedExecutor - Unhandled exception in runnable: .business.LoadReporterImpl java.lang.IllegalStateException: Cannot send event from an IoT Hub client that is closed. at com.microsoft.azure.sdk.iot.device.DeviceIO.sendEventAsync(DeviceIO.java:308) at com.microsoft.azure.sdk.iot.device.InternalClient.sendEventAsync(InternalClient.java:254) at com.microsoft.azure.sdk.iot.device.ModuleClient.sendEventAsync(ModuleClient.java:416) at .backend.EdgeControllerService.sendMessageAsync(EdgeControllerService.java:420) at .backend.EdgeControllerService.sendMessageAsync(EdgeControllerService.java:375) at .business.LoadReporterImpl.reportLoad(LoadReporterImpl.java:83) at .business.LoadReporterImpl.run(LoadReporterImpl.java:41) ---> Microsoft.Azure.Amqp.AmqpException: Put token failed. status-code: 401, status-description: The specified SAS token has an invalid signature. It does not match either the primary or secondary key.. ---> Microsoft.Azure.Amqp.AmqpException: Put token failed. status-code: 401, status-description: The specified SAS token has an invalid signature. It does not match either the primary or secondary key.. at Microsoft.Azure.Amqp.AmqpCbsLink.<>cDisplayClass4_0.b1(IAsyncResult a) at System.Threading.Tasks.TaskFactory1.FromAsyncCoreLogic(IAsyncResult iar, Func2 endFunction, Action1 endAction, Task1 promise, Boolean requiresSynchronization) --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout) --- End of inner exception stack trace --- at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpAuthenticationRefresher.InitLoopAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTConnection.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpConnectionHolder.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.EnsureSessionAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.OpenAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.<>cDisplayClass23_0.<b0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func1 asyncOperation) at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.<>c__DisplayClass33_0.<<OpenInternalAsync>b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.EnsureOpenedAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.InternalClient.OpenAsync() at Microsoft.Azure.Devices.Edge.Agent.IoTHub.ModuleClientProvider.CreateAndOpenSdkModuleClient(UpstreamProtocol upstreamProtocol, ConnectionStatusChangesHandler statusChangedHandler) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azur e.Devices.Edge.Agent.IoTHub/ModuleClientProvider.cs:line 221 at Microsoft.Azure.Devices.Edge.Agent.IoTHub.ModuleClientProvider.<>c__DisplayClass19_1.<<CreateSdkModuleClient>b__2>d.MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.IoTHub/ModuleClientProvider.cs:l ine 185 --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Edge.Util.Fallback.ExecuteAsync[T](Func1[] options) --- End of inner exception stack trace --- ---> (Inner Exception #1) Microsoft.Azure.Devices.Client.Exceptions.UnauthorizedException: error(condition:amqp:unauthorized-access,description:Put token failed. status-code: 401, status-description: The specified SAS token has an inval id signature. It does not match either the primary or secondary key..) ---> Microsoft.Azure.Amqp.AmqpException: Put token failed. status-code: 401, status-description: The specified SAS token has an invalid signature. It does not match either the primary or secondary key.. at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result) at Microsoft.Azure.Amqp.AmqpCbsLink.<>cDisplayClass4_0.b1(IAsyncResult a) at System.Threading.Tasks.TaskFactory1.FromAsyncCoreLogic(IAsyncResult iar, Func2 endFunction, Action1 endAction, Task1 promise, Boolean requiresSynchronization) --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout) --- End of inner exception stack trace --- at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpAuthenticationRefresher.InitLoopAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTConnection.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpConnectionHolder.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.EnsureSessionAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.OpenAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.<>cDisplayClass23_0.<b0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func1 asyncOperation) at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.<>c__DisplayClass33_0.<<OpenInternalAsync>b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.EnsureOpenedAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.InternalClient.OpenAsync() at Microsoft.Azure.Devices.Edge.Agent.IoTHub.ModuleClientProvider.CreateAndOpenSdkModuleClient(UpstreamProtocol upstreamProtocol, ConnectionStatusChangesHandler statusChangedHandler) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azur e.Devices.Edge.Agent.IoTHub/ModuleClientProvider.cs:line 221 at Microsoft.Azure.Devices.Edge.Agent.IoTHub.ModuleClientProvider.<>c__DisplayClass19_1.<<CreateSdkModuleClient>b__3>d.MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.IoTHub/ModuleClientProvider.cs:l ine 191 --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Edge.Util.Fallback.ExecuteAsync[T](Func1[] options)<---

<4> 2022-06-07 13:15:29.898 +00:00 [WRN] - Empty edge agent config was received. Attempting to read config from backup (//iotedge/edgeAgent/backup.json) instead <4> 2022-06-07 13:15:39.938 +00:00 [WRN] - Empty edge agent config was received. Attempting to read config from backup (//iotedge/edgeAgent/backup.json) instead <4> 2022-06-07 13:15:49.987 +00:00 [WRN] - Empty edge agent config was received. Attempting to read config from backup (//iotedge/edgeAgent/backup.json) instead ```
edge-hub logs ``` <6> 2022-06-07 13:16:59.296 +00:00 [INF] - Error getting cloud connection for device /$edgeHub Microsoft.Azure.Devices.Client.Exceptions.UnauthorizedException: error(condition:amqp:unauthorized-access,description:Put token failed. status-code: 401, status-description: The specified SAS token has an invalid signature. It does not m atch either the primary or secondary key..) ---> Microsoft.Azure.Amqp.AmqpException: Put token failed. status-code: 401, status-description: The specified SAS token has an invalid signature. It does not match either the primary or secondary key.. at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result) at Microsoft.Azure.Amqp.AmqpCbsLink.<>c__DisplayClass4_0.b__1(IAsyncResult a) at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization) --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout) --- End of inner exception stack trace --- at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpAuthenticationRefresher.InitLoopAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTConnection.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpConnectionHolder.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.EnsureSessionAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.OpenAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.<>c__DisplayClass23_0.<b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation) at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.<>c__DisplayClass33_0.<b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.EnsureOpenedAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.InternalClient.OpenAsync() at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 144 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ModuleClientWrapper.OpenAsync() in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ModuleClientWrapper.cs:line 63 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.<>c__DisplayClass30_0.<b__0>d.MoveNext() in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ConnectivityAwareClient.cs:line 195 --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.InvokeFunc[T](Func`1 func, String operation, Boolean useForConnectivityCheck) in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/Connec tivityAwareClient.cs:line 188 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.InvokeFunc[T](Func`1 func, String operation, Boolean useForConnectivityCheck) in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/Connec tivityAwareClient.cs:line 188 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.OpenAsync() in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ConnectivityAwareClient.cs:line 64 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection.ConnectToIoTHub(ITokenProvider newTokenProvider) in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnection.cs:line 134 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection.CreateNewCloudProxyAsync(ITokenProvider newTokenProvider) in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnection.cs:line 117 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection.Create(IIdentity identity, Action`2 connectionStatusChangedHandler, ITransportSettings[] transportSettings, IMessageConverterProvider messageConverterProvider, IClientProv ider clientProvider, ICloudListener cloudListener, ITokenProvider tokenProvider, TimeSpan idleTimeout, Boolean closeOnIdleTimeout, TimeSpan operationTimeout, String productInfo, Option`1 modelId) in /mnt/vss/_work/1/s/edge-hub/src/Micros oft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnection.cs:line 99 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnectionProvider.Connect(IClientCredentials clientCredentials, Action`2 connectionStatusChangedHandler) in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudPro xy/CloudConnectionProvider.cs:line 135 <6> 2022-06-07 13:17:03.487 +00:00 [INF] - Entering periodic task to reauthenticate connected clients ```
## Additional Information This has happened between several iotedge releases - it fails to reconnect after being offline for a few days. I am not sure what I am doing wrong. I use iot-device-client 1.34.3 from Maven calling open() until it succeeds... I have persistent storage of the container states Not using RTC, but save the date regularily to file and load it at the earliest point possible after boot. But it would still be days behind the today's date until systemd-timesyncd syncs the time. Could it be that it gets the token before time is synced? Shouldn't it try to get token again if it fails? I am 99% sure this is happening since I am able to reproduce with the clock put in the past. It should seriously recover from this in my opinion.
Deployment config for edgeHub / edgeAgent ``` "systemModules": { "edgeAgent": { "settings": { "image": "mcr.microsoft.com/azureiotedge-agent:1.1.13-linux-arm32v7", "createOptions": "{\"HostConfig\":{\"Memory\":104857600,\"MemorySwap\":104857600,\"Binds\":[\"//iotedge/://iotedge/\"]}}" }, "type": "docker", "env": { "storageFolder": { "value": "//iotedge" } } }, "edgeHub": { "settings": { "image": "mcr.microsoft.com/azureiotedge-hub:1.1.13-linux-arm32v7", "createOptions": "{\"HostConfig\":{\"Binds\":[\"//iotedge/://iotedge/\"],\"PortBindings\":{\"443/tcp\":[{\"HostPort\":\"443\"}],\"5671/tcp\":[{\"HostPort\":\"5671\"}],\"8883/tcp\":[{\"HostPort\":\"8883\"}]}}}" }, "type": "docker", "env": { "OptimizeForPerformance": { "value": false }, "httpSettings__enabled": { "value": false }, "mqttSettings__enabled": { "value": false }, "storageFolder": { "value": "//iotedge" } }, "status": "running", "restartPolicy": "always" } } } }, "$edgeHub": { "properties.desired": { "routes": { "route": "FROM /messages/* INTO $upstream" }, "schemaVersion": "1.1", "storeAndForwardConfiguration": { "timeToLiveSecs": 7200 } } } } }, ```
vipeller commented 2 years ago

@emilm Thank you for reporting this and the repro steps. You are right that the scenario when the local clock gets synchronized by some delay should work. I am opening a bug ticket for this and will be fixed.

emilm commented 2 years ago

@emilm Thank you for reporting this and the repro steps. You are right that the scenario when the local clock gets synchronized by some delay should work. I am opening a bug ticket for this and will be fixed.

Thanks! @vipeller ! Do you have a suggestion for a workaround in the meantime that can be solved in code? It works when iotedged is restarted (not hub or agent, or any of my services) . It could be a little difficult to restart a systemd service from within a container. If not, do you have an ETA for when this fix is in place and in what version? Thanks! :)

vipeller commented 2 years ago

@emlim sorry, I have not dug into the code yet. I need to understand the root cause to suggest something for workaround in the interim. I'll come back with some more information later.

vipeller commented 2 years ago

Hi @emilm, just an update that I started working on this. So far I have checked the source code, which seems to be written the way that when the connection fails (for any reason, including the invalid token), the does regenerate the token. I tried it from visual studio manipulating the time and it recovers after the correct time is set. My first idea was that then maybe the container time somehow got stuck in the past, but the log you pasted looks like it has the time stamp of the day you created this github issue - so I assume that it was after the time got synced and apparently the container also picked up the correct time? Anyway, trying this out with containers to see what can cause the issue

vipeller commented 2 years ago

some more updates: I tried to repro it using ubuntu 20.04 - one difference is that I use the 1.2 services (as I had that already installed). but 1.14 agent and hub

It works differently: if I start the edge with time back (by a week), I see the errors you see. But for some reason when I set the time (using timedatectl set-ntp 1), then the containers get recreated immediately (and therefore later everything works)

Keep trying with the repro

emilm commented 2 years ago

Hello! Thanks for trying to reproduce. I use 1.1.13. https://github.com/Azure/azure-iotedge/releases/download/1.1.13/iotedge-1.1.13.tar.gz and mcr.microsoft.com/azureiotedge-hub:1.1.13-linux-arm32v7 + mcr.microsoft.com/azureiotedge-hub:1.1.13-linux-arm32v7.

Set the date to 5+ days behind before starting iotedged and any container, then start them.

I would like to try 1.1.14 but there seems to be something different with this release so it doesn't contain the same structure

rahulrana-XOM commented 2 years ago

Hey @emilm, did you manage to make any headway with this?

emilm commented 2 years ago

Hello @rahulrana-XOM ! No, I still have the same issue, 1.1.14 is only container updates AFAIK, and wouldn't affect this I think, because the issue is in iotedged itself. I know this, because it starts on the old time, and the containers start in the newly synced time. So I have no chance to detect the time differences myself to make a workaround. I would then have to make a very broad watchdog as a workaround that would reboot it after x minutes / hours of being offline which is very far from optimal. OR I would have to make modifications to the OS to save the date on startup and compare that date to the date n minutes after startup and check if the delta is too big, and then restart.

As you can see these are quite clunky and dirty solutions, and time consuming and costly to implement.

btw: How long does the clock drift have to be for it to be invalid?

vipeller commented 2 years ago

@emilm I've been trying to repro it - last time I used a raspberry for 32 bit, but it seems working the way I test it, or it work differently than you described.

on a pi I tested two ways:

first step in all case is to change the time back 5 days.

then If I restart the device, what happens is that edge daemon tries to connect iothub, which will fail with a token problem and at the end no modules will be started. If I understand, this does not happen to you

if I don't restart the entire device, only I stop the containers, then they come up, showing the same error message you have. But when I turn the time sync on and the time gets synced, both recover.

Best would be a call so we can look closely how you do your stuff, maybe that would help me to repro it on my side.

If you shoot me a mail, the address is like my github id and the domain is the MS domain - i can set a call up then

emilm commented 2 years ago

Thanks a lot @vipeller ! I am definitely not ruling out that I am doing something wrong. It could be similar symptoms but the problem might be different. But I can consistently reproduce it. I will shoot you a mail! thanks!

github-actions[bot] commented 2 years ago

This issue is being marked as stale because it has been open for 30 days with no activity.