Azure / iotedge

The IoT Edge OSS project
MIT License
1.46k stars 458 forks source link

Unable to communicate with Edge modules (in EFLOW) from the host OS #7220

Closed bhjertaas closed 4 months ago

bhjertaas commented 7 months ago

Expected Behavior

We should be able to communicate with Edge modules from the host OS

Current Behavior

Sometimes, after reboot of the PC, communication from host OS to Edge modules does not work. We can Connect-EflowVm and Get-EflowVmAddr and things, but making HTTP calls to the Edge modules within Eflow doesn't work. The requests simply time out. We don't have much exceptions and logs to provide unfortunately, but here is the output from a Powershell on Windows

PS C:\Windows\System32> get-EflowVmAddr 
[01/17/2024 11:07:07] Querying IP and MAC addresses from virtual machine (DESKTOP-E4VKT5P-EFLOW)
  - Virtual machine MAC: 00:15:5d:43:7d:b6
 - Virtual machine IP : 172.18.92.180 retrieved directly from virtual machine
 00:15:5d:43:7d:b6
172.18.92.180 

PS C:\Windows\System32> curl http://172.18.92.180:9602/metrics curl: (28) Failed to connect to 172.18.92.180 port 9602 after 21008 ms: Couldn't connect to server 
PS C:\Windows\System32> curl http://172.18.92.180:1337/metadata curl: (28) Failed to connect to 172.18.92.180 port 1337 after 21031 ms: Couldn't connect to server 

The following image shows the problem over time. If the PC is unable to make simple HTTP requests the graph drops to zero. The test-PCs are setup to restart every 3 hours. image

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. When this happens, one can test it by running Get-EflowVmAddr and copy the IP address that comes out
  2. Try to make a request to a module like this curl http://172.18.92.180:9602/metrics which in our case gets metric data from edgeHub because we have mapped the internal 9600 port to 9602. But it doesn't matter what you request, the request won't work.

Context (Environment)

Output of iotedge check

Click here ``` Configuration checks (aziot-identity-service) --------------------------------------------- √ keyd configuration is well-formed - OK √ certd configuration is well-formed - OK √ tpmd configuration is well-formed - OK √ identityd configuration is well-formed - OK √ daemon configurations up-to-date with config.toml - OK √ identityd config toml file specifies a valid hostname - OK ‼ aziot-identity-service package is up-to-date - Warning Installed aziot-identity-service package has version 1.4.6 but 1.4.7 is the latest stable version available. Please see https://aka.ms/aziot-update-runtime for update instructions. √ host time is close to reference time - OK √ preloaded certificates are valid - OK √ keyd is running - OK √ certd is running - OK √ identityd is running - OK √ read all preloaded certificates from the Certificates Service - OK √ read all preloaded key pairs from the Keys Service - OK √ check all EST server URLs utilize HTTPS - OK √ ensure all preloaded certificates match preloaded private keys with the same ID - OK Connectivity checks (aziot-identity-service) -------------------------------------------- ‼ host can connect to and perform TLS handshake with iothub AMQP port - Warning Could not retrieve iothub_hostname from provisioning file. Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information. Since no hostname is provided, all hub connectivity tests will be skipped. ‼ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - Warning Could not retrieve iothub_hostname from provisioning file. Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information. Since no hostname is provided, all hub connectivity tests will be skipped. ‼ host can connect to and perform TLS handshake with iothub MQTT port - Warning Could not retrieve iothub_hostname from provisioning file. Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information. Since no hostname is provided, all hub connectivity tests will be skipped. √ host can connect to and perform TLS handshake with DPS endpoint - OK Configuration checks -------------------- √ aziot-edged configuration is well-formed - OK √ configuration up-to-date with config.toml - OK √ container engine is installed and functional - OK √ configuration has correct URIs for daemon mgmt endpoint - OK ‼ aziot-edge package is up-to-date - Warning Installed IoT Edge daemon has version 1.4.20 but 1.4.27 is the latest stable version available. Please see https://aka.ms/iotedge-update-runtime for update instructions. √ container time is close to host time - OK √ DNS server - OK √ production readiness: logs policy - OK √ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK √ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK √ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK Connectivity checks ------------------- 26 check(s) succeeded. 5 check(s) raised warnings. Re-run with --verbose for more details. 7 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details. iotedge-user@DESKTOP-5SO2V9I-EFLOW [ ~ ]$ sudo iotedge check Configuration checks (aziot-identity-service) --------------------------------------------- √ keyd configuration is well-formed - OK √ certd configuration is well-formed - OK √ tpmd configuration is well-formed - OK √ identityd configuration is well-formed - OK √ daemon configurations up-to-date with config.toml - OK √ identityd config toml file specifies a valid hostname - OK ‼ aziot-identity-service package is up-to-date - Warning Installed aziot-identity-service package has version 1.4.6 but 1.4.7 is the latest stable version available. Please see https://aka.ms/aziot-update-runtime for update instructions. √ host time is close to reference time - OK √ preloaded certificates are valid - OK √ keyd is running - OK √ certd is running - OK √ identityd is running - OK √ read all preloaded certificates from the Certificates Service - OK √ read all preloaded key pairs from the Keys Service - OK √ check all EST server URLs utilize HTTPS - OK √ ensure all preloaded certificates match preloaded private keys with the same ID - OK Connectivity checks (aziot-identity-service) -------------------------------------------- ‼ host can connect to and perform TLS handshake with iothub AMQP port - Warning Could not retrieve iothub_hostname from provisioning file. Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information. Since no hostname is provided, all hub connectivity tests will be skipped. ‼ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - Warning Could not retrieve iothub_hostname from provisioning file. Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information. Since no hostname is provided, all hub connectivity tests will be skipped. ‼ host can connect to and perform TLS handshake with iothub MQTT port - Warning Could not retrieve iothub_hostname from provisioning file. Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information. Since no hostname is provided, all hub connectivity tests will be skipped. √ host can connect to and perform TLS handshake with DPS endpoint - OK Configuration checks -------------------- √ aziot-edged configuration is well-formed - OK √ configuration up-to-date with config.toml - OK √ container engine is installed and functional - OK √ configuration has correct URIs for daemon mgmt endpoint - OK ‼ aziot-edge package is up-to-date - Warning Installed IoT Edge daemon has version 1.4.20 but 1.4.27 is the latest stable version available. Please see https://aka.ms/iotedge-update-runtime for update instructions. √ container time is close to host time - OK √ DNS server - OK √ production readiness: logs policy - OK √ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK √ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK √ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK Connectivity checks ------------------- 26 check(s) succeeded. 5 check(s) raised warnings. Re-run with --verbose for more details. 7 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details. ```

Device Information

Runtime Versions

Note: when using Windows containers on Windows, run docker -H npipe:////./pipe/iotedge_moby_engine version instead

Additional Information

When we do encounter this issue, it remains a problem until the PC is restarted.

We have been in contact with the Iotedge-eflow team, and one member of that team have had a session with us on one of the test PCs that did not work. He believes the problem has to do with the IoT-Edge part, which is why I post this issue here.

nyanzebra commented 7 months ago

@bhjertaas would you mind providing a support bundle? The command is iotedge support-bundle.

From what I understand, there is no issue retrieving the metrics via curl from EFLOW vm side, but when trying from host machine side it does not work. Is this correct?

Additionally, is this a recent issue? Or something that has been happening?

bhjertaas commented 7 months ago

Sorry for the delay, here is a support bundle.

support_bundle_2024_02_25_14_47_17_UTC.zip

You asked: From what I understand, there is no issue retrieving the metrics via curl from EFLOW vm side, but when trying from host machine side it does not work. Is this correct? ==> Yes correct.

This has been happening for a while. We have, as mentioned, 7 test-PCs that are set to reboot and we've been seeing this problem quite regularly. If you setup a PC with EFLOW, that reboots every 2 hours say, you should be able to reproduce this behaviour within two days I estimate. (of course you need something on Windows that makes HTTP calls to a module endpoint to detect when it happens).

According to our health-reporting system, non-communication problem to this particular test-PC-6, started Feb 25th 12:15 CET.

The IP at the time was 172.18.164.255 as provided by get-eflowVmAddr.

curl http://172.18.164.255:9602/metrics
curl: (28) Failed to connect to 172.18.164.255 port 9602 after 21047 ms: Couldn't connect to server

I've had a look at the bundle and can't find much. It is unfortunate that logs for the modules themselves are "too large". See my other issue on that. But I doubt this networking problem is caused by module wrongdoing anyway.

nyanzebra commented 7 months ago

@jagadishmurugan while edge team is investigating the support bundle is there anything to recommend to @bhjertaas to validate EFLOW to host networking. Just want to make sure that all looks correct.

konichi3 commented 7 months ago

@jagadishmurugan @nyanzebra Can you follow up and share your finding?

konichi3 commented 6 months ago

@jagadishmurugan @nyanzebra Any updates on this?

nyanzebra commented 6 months ago

@bhjertaas sorry for delay, this got lost amongst other issues I have been looking at. The support bundle provided seems to have failed to get logs Error grabbing logs: log message is too large (973735539 > 1000000). Would it be possible to run a smaller window of time support bundle? For example support-bundle --since 6h?

bhjertaas commented 6 months ago

I have tried to make a support-bundle with smaller time frame, but the "log message is too large" problem remains.

Until the underlying Moby bug is fixed, the recommended Local logging driver is not a viable option. Which log driver do you recommend instead on Edge devices? I'm just thinking that perhaps we will have to sort that out first, before we can get clarity on this issue.

nyanzebra commented 5 months ago

@bhjertaas sorry for the delay here, from the linked Moby bug it seems like there is no known solution for the logger at this time. By default I believe the local log driver is used: https://learn.microsoft.com/en-us/azure/iot-edge/production-checklist?view=iotedge-1.4#set-up-default-logging-driver

Beyond that @veyalla, do we have any other official stance on log drivers? I believe we leave it open to customer needs as in above doc and that is it. Just want to double confirm.

veyalla commented 4 months ago

If there is no line of sight from Moby side to the fix the local log driver. I'd recommend we change our docs to suggest json w/ size limits as the default log driver. Adding @bishal41 @PatAltimore for following up on docs change.

bishal41 commented 4 months ago

@veyalla - I was chatting with @cpuguy83 and changing our docs to use json-file logging driver as a default won't fix the issue as it's the same problem with both drivers. @bhjertaas would you be able to try the suggestions provided in this issue and see if that helps? https://github.com/Azure/iotedge/issues/7074#issuecomment-2062913691