Closed ericvruder closed 3 years ago
@ericvruder Thanks for the report. We are investigating.
This happened once again, but this time it seems like things are a bit different.
Once again, the last message in the workspace is
<6> 2021-06-16 07:42:15.729 +00:00 [INF] - Entering periodic task to reauthenticate connected clients
But the logs on the edge agent are considerably different:
Also, the iotedge check command returns something quite different from last time:
sudo iotedge check
command outputConfiguration checks
--------------------
√ config.yaml is well-formed - OK
√ config.yaml has well-formed connection string - OK
√ container engine is installed and functional - OK
√ config.yaml has correct hostname - OK
√ config.yaml has correct URIs for daemon mgmt endpoint - OK
‼ latest security daemon - Warning
Installed IoT Edge daemon has version 1.1.1 but 1.1.3 is the latest stable version available.
Please see https://aka.ms/iotedge-update-runtime for update instructions.
‼ host time is close to real time - Warning
Could not query NTP server
caused by: could not receive NTP server response: Resource temporarily unavailable (os error 11)
√ container time is close to host time - OK
‼ DNS server - Warning
Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
You can ignore this warning if you are setting DNS server per module in the Edge deployment.
‼ production readiness: certificates - Warning
The Edge device is using self-signed automatically-generated development certificates.
They will expire in 75 days (at 2021-08-31 11:13:08 UTC) causing module-to-module and downstream device communication to fail on an active deployment.
After the certs have expired, restarting the IoT Edge daemon will trigger it to generate new development certs.
Please consider using production certificates instead. See https://aka.ms/iotedge-prod-checklist-certs for best practices.
√ production readiness: container engine - OK
√ production readiness: logs policy - OK
‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning
The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
Data might be lost if the module is deleted or updated.
Please see https://aka.ms/iotedge-storage-host for best practices.
× production readiness: Edge Hub's storage directory is persisted on the host filesystem - Error
Could not check current state of edgeHub container
caused by: docker returned exit code: 1, stderr = Error: No such object: edgeHub
Connectivity checks
-------------------
√ host can connect to and perform TLS handshake with DPS endpoint - OK
√ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK
√ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK
× container on the default network can connect to IoT Hub AMQP port - Error
Container on the default network could not connect to iot-connectivityservices-dev-westeurope.azure-devices.net:5671
caused by: docker returned exit code: 1, stderr = One or more errors occurred. (Resource temporarily unavailable)
× container on the default network can connect to IoT Hub HTTPS / WebSockets port - Error
Container on the default network could not connect to iot-connectivityservices-dev-westeurope.azure-devices.net:443
caused by: docker returned exit code: 1, stderr = One or more errors occurred. (Resource temporarily unavailable)
× container on the default network can connect to IoT Hub MQTT port - Error
Container on the default network could not connect to iot-connectivityservices-dev-westeurope.azure-devices.net:8883
caused by: docker returned exit code: 1, stderr = One or more errors occurred. (Resource temporarily unavailable)
√ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK
√ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK
15 check(s) succeeded.
5 check(s) raised warnings.
4 check(s) raised errors.
I don't understand why it is now, all of a sudden, claiming that there is no bind from the container to the host. There was no changes whatsoever made, this happened by itself.
EDIT:
Also, this time around, I can't even restart the modules
iotedge restart edgeAgent && iotedge restart edgeHub
This gives me the following error:
A module runtime error occurred
caused by: Could not restart module edgeAgent
caused by: Client error
I was forced to delete the edge gateway and reinstall it completely to get it working this time around.
@ericvruder Let's me check what's happening. Thanks.
@ericvruder Could you give more details about your environment? I had run some tests using VM's and development certificates and I can't reproduce the problem. Is it a production or test environment?
This is in a production environment - I think that the last error message, "entering reauthentication step" just so happens to be the last message because there is no "successfully reauthenticated" message, and it is the message that seems to be most often logged in our scenario.
After reviewing the event logs in event viewer, I was then able to determine that a shutdown happened for the physical machine on which eflow is running on. That happened ~3 minutes after iotedge stopped reporting. I'm sorry for not exploring this before creating the original error report - I really should have caught this.
The shutdown probably happened a bit earlier, as can be seen here, where hyper-v reports a shutdown request:
The last log message is ~ 15 minutes before the shutdown event, from "RestartManager" not being able to restart a unrelated application.
Following the reboot, I can see that the machine is having a hard time re-establishing network connectivity. I see log messages about shared drives not being mapped and timeouts from a few applications. I am having a really hard time establishing when the machine re-established network because my logs are incredibly noisy.
I apologize for not catching this earlier. Next time around, I will see if I can restart the machine and try to re-establish network connectivity on the physical machine, without interfering with eflow. I will report back with the results. I will close this issue in the mean time.
We have a few edge gateways deployed as part of our solution, and periodically they would fail their reauthentication step. The gateway would be running fine, and all of a sudden the last log in our workspace would be
and then nothing else. If I log into the machine and look at the logs, these error messages keep getting written:
Logs
``` <6> 2021-06-08 07:56:04.932 +00:00 [INF] - Attempting to connect to IoT Hub for client 4011-p1-PC0226645/opcpublisher via AMQP... <4> 2021-06-08 07:56:25.681 +00:00 [WRN] - Error creating cloud connection for client 4011-p1-PC0226645/opcpublisher Microsoft.Azure.Devices.Client.Exceptions.IotHubCommunicationException: Transient network error occurred, please retry. ---> System.Net.Sockets.SocketException (11): Resource temporarily unavailable at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTTransport.InitializeAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpIoTConnector.OpenConnectionAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpConnectionHolder.EnsureConnectionAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpConnectionHolder.OpenSessionAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.EnsureSessionAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.OpenAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.<>c__DisplayClass23_0.<On and on and on. I cannot find anything more specific than this. If you could help me find more precise info, that would also be appreciated.
I do not know how to reproduce this, this happens periodically. If you could give me some things to try out to try to reproduce this, that would be greatly appreciated. This also happens with different machines at different times.
If I restart the system modules, everything starts working again no problem. So, running this solves the problem:
Output of
iotedge check
Device Information
Windows 10
Runtime Versions