Azure / iotedge

The IoT Edge OSS project
MIT License
1.45k stars 458 forks source link

Error in IoTEdgeMetricsCollector module: Error scraping endpoint http://edgeAgent:9600/metrics and http://edgeHub:9600/metrics #6552

Closed aleducq closed 2 years ago

aleducq commented 2 years ago

Expected Behavior

Following the documentation https://docs.microsoft.com/en-us/azure/iot-edge/how-to-collect-and-transport-metrics?view=iotedge-2020-11&tabs=iothub I should be able to see my modules in workbook. Trying the Option 1 (with log analytics workspace)

Current Behavior

[2022-08-01 10:01:23 : Starting Module
[2022-08-01 10:01:24.000 INF] Initializing metrics collector
[2022-08-01 10:01:24.129 INF] Version - 1.0.9
[2022-08-01 10:01:24.134 DBG] Waiting on shutdown handler to trigger
[2022-08-01 10:01:24.242 INF] Metrics collector initialized with the following settings:
Settings:
LogAnalyticsWorkspaceId=44f471c6-dd0a-4e9d-9e14-7f2d2485b0c3
Endpoints=[
  "http://edgeHub:9600/metrics",
  "http://edgeAgent:9600/metrics"
]
ScrapeFrequencySecs=300
UploadTarget=AzureMonitor
CompressForUpload=True
TransformForIoTCentral=False
ExperimentalFeatureAddIdentifyingTags=False
AllowedMetrics=
BlockedMetrics=
ResourceId=/subscriptions/b6c68d5e-168e-454c-93e9-8c19598352da/resourcegroups/gsebdsintshaweeursg02/providers/microsoft.operationalinsights/workspaces/gsebdsintshaweeulaw01
IotHubConnectFrequency=1.00:00:00
AzureDomain=azure.com
[2022-08-01 10:01:25.613 INF] Started operation Reconnect to IoT Hub
[2022-08-01 10:01:25.614 INF] Started operation Scrape and Upload Metrics
[2022-08-01 10:02:25.613 INF] Starting periodic operation Reconnect to IoT Hub...
[2022-08-01 10:02:25.647 INF] Starting periodic operation Scrape and Upload Metrics...
[2022-08-01 10:02:25.652 INF] Scraping endpoint http://edgeHub:9600/metrics
[2022-08-01 10:02:25.726 INF] Scraping endpoint http://edgeAgent:9600/metrics
[2022-08-01 10:02:28.604 INF] Closed and re-established connection to IoT Hub
[2022-08-01 10:02:28.604 INF] Successfully completed periodic operation Reconnect to IoT Hub
[2022-08-01 10:02:31.595 ERR] Error scraping endpoint http://edgeAgent:9600/metrics - System.Net.Http.HttpRequestException: Resource temporarily unavailable
 ---> System.Net.Sockets.SocketException (11): Resource temporarily unavailable
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean allowHttp2, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Microsoft.Azure.Devices.Edge.Azure.Monitor.MetricsScraper.ScrapeEndpoint(String endpoint, CancellationToken cancellationToken) in /mnt/vss/_work/1/s/edge-modules/metrics-collector/src/MetricsScraper.cs:line 72
[2022-08-01 10:02:31.603 INF] Scraping finished, received 0 metrics from endpoint http://edgeAgent:9600/metrics
[2022-08-01 10:02:32.581 ERR] Error scraping endpoint http://edgeHub:9600/metrics - System.Net.Http.HttpRequestException: Resource temporarily unavailable
 ---> System.Net.Sockets.SocketException (11): Resource temporarily unavailable
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean allowHttp2, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Microsoft.Azure.Devices.Edge.Azure.Monitor.MetricsScraper.ScrapeEndpoint(String endpoint, CancellationToken cancellationToken) in /mnt/vss/_work/1/s/edge-modules/metrics-collector/src/MetricsScraper.cs:line 72
[2022-08-01 10:02:32.581 INF] Scraping finished, received 0 metrics from endpoint http://edgeHub:9600/metrics
Successfully created self-signed certificate  for agentGuid : {6db1ee30-e318-4a75-b3f0-24652e4c4032} and workspace: 44f471c6-dd0a-4e9d-9e14-7f2d2485b0c3
OMS endpoint Url : https://44f471c6-dd0a-4e9d-9e14-7f2d2485b0c3.oms.opinsights.azure.com/AgentService.svc/AgentTopologyRequest
sent registration request
waiting response for registration request : OK
registration request processed
Response result status code : OK
Return Result: <?xml version="1.0" encoding="utf-8"?><AgentTopologyResponse queryInterval="PT1H" xmlns="http://schemas.microsoft.com/WorkloadMonitoring/HealthServiceProtocol/2014/09/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><ManagementGroups><ManagementGroup><Name>AOI-44f471c6-dd0a-4e9d-9e14-7f2d2485b0c3</Name><ManagementGroupId>9b689c9d-e619-e158-db9d-c22cff3a77be</ManagementGroupId><HealthServiceId>6db1ee30-e318-4a75-b3f0-24652e4c4032</HealthServiceId><StateSynchronizationEndpoint isAuthenticated="true" queryInterval="PT5M"><Endpoint>https://44f471c6-dd0a-4e9d-9e14-7f2d2485b0c3.oms.opinsights.azure.com/ConfigurationService.Svc/GetCurrentConfigurationCookie</Endpoint><ResponseSelector>/GetConfigurationCookieResponse/ConfigurationCookie</ResponseSelector><RequestHeaders><Header>DataServiceVersion: 1.0;SCHsvc</Header><Header>MaxDataServiceVersion: 2.0;SCHsvc</Header></RequestHeaders></StateSynchronizationEndpoint><ConfigurationRequestEndpoint>https://44f471c6-dd0a-4e9d-9e14-7f2d2485b0c3.oms.opinsights.azure.com/ConfigurationService.Svc/GetConfigurationFile</ConfigurationRequestEndpoint><ManagementPackRequestEndpoint>https://44f471c6-dd0a-4e9d-9e14-7f2d2485b0c3.oms.opinsights.azure.com/ConfigurationService.Svc/GetManagementPack</ManagementPackRequestEndpoint></ManagementGroup></ManagementGroups><CertificateUpdateEndpoint updateCertificate="false">https://44f471c6-dd0a-4e9d-9e14-7f2d2485b0c3.oms.opinsights.azure.com/ConfigurationService.Svc/RenewCertificate</CertificateUpdateEndpoint></AgentTopologyResponse>
StatusCode: 200, ReasonPhrase: 'OK', Version: 1.1, Content: System.Net.Http.HttpConnectionResponseContent, Headers:
{
  Cache-Control: private
  Server: Microsoft-IIS/10.0
  X-AspNet-Version: 4.0.30319
  X-Powered-By: ASP.NET
  Strict-Transport-Security: max-age=31536000; includeSubDomains
  X-Content-Type-Options: nosniff
  Date: Mon, 01 Aug 2022 10:02:34 GMT
  Content-Type: application/xml; charset=utf-8
  Content-Length: 1574
}
[2022-08-01 10:02:34.402 DBG] 200 OK
[2022-08-01 10:02:34.405 INF] Successfully sent 0 metrics to fixed set table
[2022-08-01 10:02:34.405 INF] Successfully completed periodic operation Scrape and Upload Metrics

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Create a layered deployment with IoTEdgeMetricsCollector module
  2. Set Environment variables (ResourceId, UploadTarget, LogAnalyticsWorkspaceId, LogAnalyticsSharedKey, https_proxy, http_proxy, NO_PROXY)

Context (Environment)

Device Information

Runtime Versions

Additional Information

If I add modules http://twin:9701/metrics,http://publisher:9702/metrics in MetricsEndpointsCSV, there's no error scraping these endpoints but nothing is coming up to the LAW. From the module I've checked that the ports are open:

# docker exec -it IoTEdgeMetricsCollector /bin/sh
/app $ nc -vz edgeAgent 9600
edgeAgent (172.18.0.2:9600) open
/app $ nc -vz edgeHub 9600
edgeHub (172.18.0.6:9600) open

I also tried to add the module in deployment (not layered deployment), but the result is the same.

and-rewsmith commented 2 years ago

Hi @aleducq. Are EdgeHub and EdgeAgent modules running as expected? Can I see iotedge check output?

and-rewsmith commented 2 years ago

@aleducq I just synced with a colleague who informed me this issue (along with #6551 too), are related issues and also support tickets opened last week.

Both github issues have the same root cause. Once you correct the missing routes field in the deployment manifest everything should begin operating as expected.

aleducq commented 2 years ago

ok @and-rewsmith, maybe we will wait to solve the issue #6551 to be sure but for your information I have this kind of message in edgeAgent:

<6> 2022-08-02 08:17:42.965 +00:00 [INF] - Starting periodic operation refresh twin config...
<6> 2022-08-02 08:17:43.079 +00:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 37 and reported properties version 86.
<6> 2022-08-02 08:17:43.080 +00:00 [INF] - Successfully completed periodic operation refresh twin config
<6> 2022-08-02 08:17:43.427 +00:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2022-08-02 08:17:43.430 +00:00 [INF] - Scraping Metrics
<6> 2022-08-02 08:17:43.434 +00:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2022-08-02 08:17:43.456 +00:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2022-08-02 08:18:27.985 +00:00 [INF] - Error connecting to http://172.18.0.2:9600/metrics with result error code GatewayTimeout
<6> 2022-08-02 08:18:27.986 +00:00 [INF] - Error connecting to http://172.18.0.3:9600/metrics with result error code GatewayTimeout
<6> 2022-08-02 08:18:27.991 +00:00 [INF] - Storing Metrics
<6> 2022-08-02 08:18:28.009 +00:00 [INF] - Scraped and Stored Metrics
<6> 2022-08-02 08:18:28.009 +00:00 [INF] - Successfully completed periodic operation Metrics Scrape
and-rewsmith commented 2 years ago

Yes let's focus on #6551 and come back to this one.

aleducq commented 2 years ago

@and-rewsmith, I still have the issue even if #6551 seems to be solved

and-rewsmith commented 2 years ago

Is iotedge check output all green? Any errors from EdgeHub / EdgeAgent logs?

If you don't care about data loss and want to try a further diagnostic step, you can force remove all containers on device and wait for them to be restarted. If the problem persists it will tell us something.

aleducq commented 2 years ago

@and-rewsmith, the iotedge check command failed to execute (invalid uri character), I don't know why...

[root@sl24q682 dsmsaiotedgenoprod]# iotedge check
Error while fetching latest versions of edge components: could not create HTTP client
        caused by: invalid uri character

For the logs, I've noticed some errors in EdgeAgent:

<6> 2022-08-05 13:21:50.035 +00:00 [INF] - Starting periodic operation Metrics Upload...
<6> 2022-08-05 13:21:50.035 +00:00 [INF] - Uploading Metrics
<4> 2022-08-05 13:21:50.036 +00:00 [WRN] - Error in periodic operation Metrics Upload
System.InvalidOperationException: Sequence contains no elements
   at System.Linq.ThrowHelper.ThrowNoElementsException()
   at Microsoft.Azure.Devices.Edge.Agent.Diagnostics.Storage.MetricsStorage.GetAllMetricsInternal(CancellationToken cancellationToken)+MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Diagnostics/storage/MetricsStorage.cs:line 51
   at Microsoft.Azure.Devices.Edge.Agent.Diagnostics.Storage.MetricsStorage.GetAllMetricsInternal(CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
   at System.Linq.AsyncEnumerable.<ToListAsync>g__Core|620_0[TSource](IAsyncEnumerable`1 source, CancellationToken cancellationToken) in /_/Ix.NET/Source/System.Linq.Async/System/Linq/Operators/ToList.cs:line 36
   at System.Linq.AsyncEnumerable.<ToListAsync>g__Core|620_0[TSource](IAsyncEnumerable`1 source, CancellationToken cancellationToken) in /_/Ix.NET/Source/System.Linq.Async/System/Linq/Operators/ToList.cs:line 36
   at Microsoft.Azure.Devices.Edge.Agent.Diagnostics.Storage.MetricsStorage.GetAllMetricsAsync() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Diagnostics/storage/MetricsStorage.cs:line 38
   at Microsoft.Azure.Devices.Edge.Agent.Diagnostics.MetricsWorker.TryUploadAndClear(CancellationToken cancellationToken) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Diagnostics/MetricsWorker.cs:line 149
   at Microsoft.Azure.Devices.Edge.Agent.Diagnostics.MetricsWorker.Upload(CancellationToken cancellationToken) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Diagnostics/MetricsWorker.cs:line 138
   at Microsoft.Azure.Devices.Edge.Util.PeriodicTask.DoWork() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/PeriodicTask.cs:line 106
<6> 2022-08-05 13:32:57.741 +00:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2022-08-05 13:32:57.741 +00:00 [INF] - Scraping Metrics
<6> 2022-08-05 13:32:57.741 +00:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2022-08-05 13:32:57.742 +00:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2022-08-05 13:33:42.260 +00:00 [INF] - Error connecting to http://172.18.0.6:9600/metrics with result error code GatewayTimeout
<6> 2022-08-05 13:33:42.260 +00:00 [INF] - Error connecting to http://172.18.0.2:9600/metrics with result error code GatewayTimeout
<6> 2022-08-05 13:33:42.260 +00:00 [INF] - Storing Metrics
<6> 2022-08-05 13:33:42.260 +00:00 [INF] - Scraped and Stored Metrics
<6> 2022-08-05 13:33:42.260 +00:00 [INF] - Successfully completed periodic operation Metrics Scrape
and-rewsmith commented 2 years ago

Questions:

  1. What version of iotedge are you running?
  2. What platform are you running on?
  3. Does EdgeHub / EdgeAgent operation look like it is operating as expected? Any errors / warnings in the logs?
aleducq commented 2 years ago

Host OS: CentOS Linux release 7.9.2009 (Core) Architecture: Linux 3.10.0-1160.42.2.el7.x86_64 Container OS: Linux containers aziot-edged : iotedge 1.1.5 Edge Agent: azureiotedge-agent:1.1 Edge Hub: azureiotedge-hub:1.1 Docker/Moby: 3.0.6

No errors or warnings except the ones I shared in my previous comment.

and-rewsmith commented 2 years ago

I would need to take a closer look at what is going on. Specifically the system module logs for EdgeHub / EdgeAgent. If you send the support bundle to me as an attachment I can take a look (andsmi@microsoft.com).

Please collect 7 days worth: iotedge support-bundle --since "7d"

konichi3 commented 2 years ago

@aleducq It would be better if you can open support ticket so our support team can help triage the issue.

and-rewsmith commented 2 years ago

@aleducq Did you open a support ticket?

aleducq commented 2 years ago

Hello @and-rewsmith, I'm back from vacation, I've just opened a ticket. I will let you know