Open emilm opened 2 years ago
@emilm Thank you for reporting this and the repro steps. You are right that the scenario when the local clock gets synchronized by some delay should work. I am opening a bug ticket for this and will be fixed.
@emilm Thank you for reporting this and the repro steps. You are right that the scenario when the local clock gets synchronized by some delay should work. I am opening a bug ticket for this and will be fixed.
Thanks! @vipeller ! Do you have a suggestion for a workaround in the meantime that can be solved in code? It works when iotedged is restarted (not hub or agent, or any of my services) . It could be a little difficult to restart a systemd service from within a container. If not, do you have an ETA for when this fix is in place and in what version? Thanks! :)
@emlim sorry, I have not dug into the code yet. I need to understand the root cause to suggest something for workaround in the interim. I'll come back with some more information later.
Hi @emilm, just an update that I started working on this. So far I have checked the source code, which seems to be written the way that when the connection fails (for any reason, including the invalid token), the does regenerate the token. I tried it from visual studio manipulating the time and it recovers after the correct time is set. My first idea was that then maybe the container time somehow got stuck in the past, but the log you pasted looks like it has the time stamp of the day you created this github issue - so I assume that it was after the time got synced and apparently the container also picked up the correct time? Anyway, trying this out with containers to see what can cause the issue
some more updates: I tried to repro it using ubuntu 20.04 - one difference is that I use the 1.2 services (as I had that already installed). but 1.14 agent and hub
It works differently: if I start the edge with time back (by a week), I see the errors you see. But for some reason when I set the time (using timedatectl set-ntp 1), then the containers get recreated immediately (and therefore later everything works)
Keep trying with the repro
Hello! Thanks for trying to reproduce. I use 1.1.13. https://github.com/Azure/azure-iotedge/releases/download/1.1.13/iotedge-1.1.13.tar.gz and mcr.microsoft.com/azureiotedge-hub:1.1.13-linux-arm32v7 + mcr.microsoft.com/azureiotedge-hub:1.1.13-linux-arm32v7.
Set the date to 5+ days behind before starting iotedged and any container, then start them.
I would like to try 1.1.14 but there seems to be something different with this release so it doesn't contain the same structure
Hey @emilm, did you manage to make any headway with this?
Hello @rahulrana-XOM ! No, I still have the same issue, 1.1.14 is only container updates AFAIK, and wouldn't affect this I think, because the issue is in iotedged itself. I know this, because it starts on the old time, and the containers start in the newly synced time. So I have no chance to detect the time differences myself to make a workaround. I would then have to make a very broad watchdog as a workaround that would reboot it after x minutes / hours of being offline which is very far from optimal. OR I would have to make modifications to the OS to save the date on startup and compare that date to the date n minutes after startup and check if the delta is too big, and then restart.
As you can see these are quite clunky and dirty solutions, and time consuming and costly to implement.
btw: How long does the clock drift have to be for it to be invalid?
@emilm I've been trying to repro it - last time I used a raspberry for 32 bit, but it seems working the way I test it, or it work differently than you described.
on a pi I tested two ways:
first step in all case is to change the time back 5 days.
then If I restart the device, what happens is that edge daemon tries to connect iothub, which will fail with a token problem and at the end no modules will be started. If I understand, this does not happen to you
if I don't restart the entire device, only I stop the containers, then they come up, showing the same error message you have. But when I turn the time sync on and the time gets synced, both recover.
Best would be a call so we can look closely how you do your stuff, maybe that would help me to repro it on my side.
If you shoot me a mail, the address is like my github id and the domain is the MS domain - i can set a call up then
Thanks a lot @vipeller ! I am definitely not ruling out that I am doing something wrong. It could be similar symptoms but the problem might be different. But I can consistently reproduce it. I will shoot you a mail! thanks!
This issue is being marked as stale because it has been open for 30 days with no activity.
Expected Behavior
It should reauthenticate through TPM when booting
Current Behavior
It fails to get new token in edgeHub
Steps to Reproduce
Provide a detailed set of steps to reproduce the bug.
Context (Environment)
Embedded
Output of
iotedge check
Click here
``` Configuration checks -------------------- √ config.yaml is well-formed - OK √ config.yaml has well-formed connection string - OK √ container engine is installed and functional - OK × config.yaml has correct hostname - Error config.yaml has hostnameDevice Information
Runtime Versions
Server: Engine: Version: 19.03.15 API version: 1.40 (minimum version 1.12) Go version: go1.14.15 Git commit: 420b1d36250f9cfdc561f086f25a213ecb669b6f Built: Wed Jul 21 14:23:45 2021 OS/Arch: linux/arm Experimental: true containerd: Version: v1.2.14-3-g3b3e9d5f6.m GitCommit: 3b3e9d5f62a114153829f9fbe2781d27b0a2ddac.m runc: Version: 1.0.0-rc8 GitCommit: 425e105d5a03fabd737a126ad93d62a9eeede87f-dirty docker-init: Version: 0.18.0 GitCommit: fec3683-dirty
Jun 07 13:14:04 iotedged[657]: 2022-06-07T13:14:04Z [ERR!] - server connection error: (unknown)
Jun 07 13:14:04 iotedged[657]: 2022-06-07T13:14:04Z [ERR!] - error writing a body to connection: Broken pipe (os error 32)
Jun 07 13:14:14 iotedged[657]: 2022-06-07T13:14:14Z [INFO] - [mgmt] - - - [2022-06-07 13:14:14.652387057 UTC] "GET /modules?api-version=2020-07-07 HTTP/1.1" 200 OK 5263 "-" "-" auth_id(-)
Jun 07 13:14:17 iotedged[657]: 2022-06-07T13:14:17Z [INFO] - [work] - - - [2022-06-07 13:14:17.904619514 UTC] "POST /modules/%24edgeAgent/genid/637714546228305949/sign?api-version=2018-06-28 HTTP/1.1" 200 OK 57 "-" "-" auth_id(-)
Jun 07 13:14:18 iotedged[657]: 2022-06-07T13:14:18Z [INFO] - [mgmt] - - - [2022-06-07 13:14:18.231480021 UTC] "POST /device/reprovision?api-version=2020-07-07 HTTP/1.1" 200 OK - "-" "-" auth_id(-)
Jun 07 13:14:18 iotedged[657]: 2022-06-07T13:14:18Z [INFO] - [work] - - - [2022-06-07 13:14:18.795788386 UTC] "POST /modules/%24edgeAgent/genid/637714546228305949/sign?api-version=2018-06-28 HTTP/1.1" 200 OK 57 "-" "-" auth_id(-)
Jun 07 13:14:19 iotedged[657]: 2022-06-07T13:14:19Z [INFO] - [mgmt] - - - [2022-06-07 13:14:19.120883185 UTC] "POST /device/reprovision?api-version=2020-07-07 HTTP/1.1" 200 OK - "-" "-" auth_id(-)
Jun 07 13:14:24 iotedged[657]: 2022-06-07T13:14:24Z [INFO] - [mgmt] - - - [2022-06-07 13:14:24.690852211 UTC] "GET /modules?api-version=2020-07-07 HTTP/1.1" 200 OK 5263 "-" "-" auth_id(-)
Jun 07 13:14:34 iotedged[657]: 2022-06-07T13:14:34Z [INFO] - [mgmt] - - - [2022-06-07 13:14:34.729423980 UTC] "GET /modules?api-version=2020-07-07 HTTP/1.1" 200 OK 5263 "-" "-" auth_id(-)
Jun 07 13:14:44 iotedged[657]: 2022-06-07T13:14:44Z [INFO] - [mgmt] - - - [2022-06-07 13:14:44.770494473 UTC] "GET /modules?api-version=2020-07-07 HTTP/1.1" 200 OK 5263 "-" "-" auth_id(-)
2022-06-07T13:13:04,343Z [.business.LoadReporterImpl-0] ERROR .utils.TimedExecutor - Unhandled exception in runnable: .business.LoadReporterImpl
java.lang.IllegalStateException: Cannot send event from an IoT Hub client that is closed.
at com.microsoft.azure.sdk.iot.device.DeviceIO.sendEventAsync(DeviceIO.java:308)
at com.microsoft.azure.sdk.iot.device.InternalClient.sendEventAsync(InternalClient.java:254)
at com.microsoft.azure.sdk.iot.device.ModuleClient.sendEventAsync(ModuleClient.java:416)
at .backend.EdgeControllerService.sendMessageAsync(EdgeControllerService.java:420)
at .backend.EdgeControllerService.sendMessageAsync(EdgeControllerService.java:375)
at .business.LoadReporterImpl.reportLoad(LoadReporterImpl.java:83)
at .business.LoadReporterImpl.run(LoadReporterImpl.java:41)
---> Microsoft.Azure.Amqp.AmqpException: Put token failed. status-code: 401, status-description: The specified SAS token has an invalid signature. It does not match either the primary or secondary key..
---> Microsoft.Azure.Amqp.AmqpException: Put token failed. status-code: 401, status-description: The specified SAS token has an invalid signature. It does not match either the primary or secondary key..
at Microsoft.Azure.Amqp.AmqpCbsLink.<>cDisplayClass4_0.b1(IAsyncResult a)
at System.Threading.Tasks.TaskFactoryb 0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Funcb1(IAsyncResult a)
at System.Threading.Tasks.TaskFactoryb 0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func
<4> 2022-06-07 13:15:29.898 +00:00 [WRN] - Empty edge agent config was received. Attempting to read config from backup (/1.FromAsyncCoreLogic(IAsyncResult iar, Func
2 endFunction, Action1 endAction, Task
1 promise, Boolean requiresSynchronization) --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout) --- End of inner exception stack trace --- at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpAuthenticationRefresher.InitLoopAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTConnection.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpConnectionHolder.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.EnsureSessionAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.OpenAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.<>cDisplayClass23_0.<1 asyncOperation) at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.<>c__DisplayClass33_0.<<OpenInternalAsync>b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.EnsureOpenedAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.InternalClient.OpenAsync() at Microsoft.Azure.Devices.Edge.Agent.IoTHub.ModuleClientProvider.CreateAndOpenSdkModuleClient(UpstreamProtocol upstreamProtocol, ConnectionStatusChangesHandler statusChangedHandler) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azur e.Devices.Edge.Agent.IoTHub/ModuleClientProvider.cs:line 221 at Microsoft.Azure.Devices.Edge.Agent.IoTHub.ModuleClientProvider.<>c__DisplayClass19_1.<<CreateSdkModuleClient>b__2>d.MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.IoTHub/ModuleClientProvider.cs:l ine 185 --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Edge.Util.Fallback.ExecuteAsync[T](Func
1[] options) --- End of inner exception stack trace --- ---> (Inner Exception #1) Microsoft.Azure.Devices.Client.Exceptions.UnauthorizedException: error(condition:amqp:unauthorized-access,description:Put token failed. status-code: 401, status-description: The specified SAS token has an inval id signature. It does not match either the primary or secondary key..) ---> Microsoft.Azure.Amqp.AmqpException: Put token failed. status-code: 401, status-description: The specified SAS token has an invalid signature. It does not match either the primary or secondary key.. at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result) at Microsoft.Azure.Amqp.AmqpCbsLink.<>cDisplayClass4_0.1.FromAsyncCoreLogic(IAsyncResult iar, Func
2 endFunction, Action1 endAction, Task
1 promise, Boolean requiresSynchronization) --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout) --- End of inner exception stack trace --- at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpAuthenticationRefresher.InitLoopAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTConnection.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpConnectionHolder.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.EnsureSessionAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.OpenAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.<>cDisplayClass23_0.<1 asyncOperation) at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.<>c__DisplayClass33_0.<<OpenInternalAsync>b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.EnsureOpenedAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.InternalClient.OpenAsync() at Microsoft.Azure.Devices.Edge.Agent.IoTHub.ModuleClientProvider.CreateAndOpenSdkModuleClient(UpstreamProtocol upstreamProtocol, ConnectionStatusChangesHandler statusChangedHandler) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azur e.Devices.Edge.Agent.IoTHub/ModuleClientProvider.cs:line 221 at Microsoft.Azure.Devices.Edge.Agent.IoTHub.ModuleClientProvider.<>c__DisplayClass19_1.<<CreateSdkModuleClient>b__3>d.MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.IoTHub/ModuleClientProvider.cs:l ine 191 --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Edge.Util.Fallback.ExecuteAsync[T](Func
1[] options)<---edge-hub logs
``` <6> 2022-06-07 13:16:59.296 +00:00 [INF] - Error getting cloud connection for deviceDeployment config for edgeHub / edgeAgent
``` "systemModules": { "edgeAgent": { "settings": { "image": "mcr.microsoft.com/azureiotedge-agent:1.1.13-linux-arm32v7", "createOptions": "{\"HostConfig\":{\"Memory\":104857600,\"MemorySwap\":104857600,\"Binds\":[\"/