CosmosDB Endpoint Document looses its dataSetWriter object.

JaivyDaam commented 1 year ago

Describe the bug The CosmosDB looses its endpoint configuration for JobConfiguration.Job.writerGroup.dataSetWriters object. This seems at random interval, as we don't have this in our PoC. It might be specific to our case and I don't have a clear way of reproducing this bug.

We noticed as we didn't receive any messages anymore in our eventhub. Upon investigating, I've noticed that the job config was empty.

Expected behavior I expect that the object is not deleted in the CosmosDB.

Additional context I've added the logs that have a 5 minute span, in the publisher log you can see that it cancels the job. [2023-03-10 11:19:31.461 INF Microsoft.Azure.IIoT.Agent.Framework.Agent.Worker+JobProcess] Job {endpointID} cancelled.

I restarted it soon after. Support bundle below:

support_bundle_2023_03_10_11_31_28_UTC.zip

marcschier commented 1 year ago

Hi @JaivyDaam, I cannot think of anything other than the API to unpublish the nodes must have been called. There is no path that "automatically" updates the CosmosDB documents in the code base itself. Since the publisher is just observing the changes in the database it makes sense that it stopped sending data when the configuration is deleted. The only approach I can see to lock this down is to enable diagnostics on the data plane for Cosmos DB: https://learn.microsoft.com/en-us/azure/cosmos-db/monitor-resource-logs?tabs=azure-portal and see who is changing the content of the document.

JaivyDaam commented 1 year ago

Hi @marcschier, I've luckily already enabled the cosmosDB diagnostics and was able to search for the external IP that called upon the CosmosDB. This indeed was the app service running the all-in-one container. Looking upon the transactions, it seems that some-thing/one indeed did a post the publisher bulk endpoint. The logs says that: device: PC and ip: 0.0.0.0 which leads me to believe someone actually used the swagger API to edit these endpoints, after asking internally none have said they did nor can I imagine they did. Just in case, I turned off the Engineering Tool and re-added the jobconfiguration to the CosmosDB. Since then, no issues has been found.

I'll continue to digest the logs for whenever it happens again. If you have a recommendation on how to enable request bodies that the "all-in-one" container receives/sends and/or the CosmosDB receives, I'd be glad to hear it but I am unable to find it.

I'll keep monitoring for now and thank you for your swift reply!

JaivyDaam commented 1 year ago

Hi @marcschier, I've come back to you with more information.

After disabling the Engineering Tool and rechecking our code (No service has a route the "all-in-one" container) the datasetwriter was still deleted from CosmosDB.

I've made some screenshots regarding the End-to-end transactions hoping to provide you with some additional information. Obviously they have been anonymised, but I can provide you with some text that goes with the screenshots.

In the transaction list you see 3 requests, where the middle one is an odd one and seems to search for an endpoint that is not activated and thus doesn't exist in the CosmosDB. So this fails.
In the transaction details of the BulkPublishValues, there is no real information as to what the request body was and by whom. The IP seems to be always 0.0.0.0 and doesn't come from anywhere else in the whole End-to-end details.
I must mention, I've used the CosmosDB directly to inject the datasetwriter object, thus I can imagine somewhere in additional services (maybe onboarding) that the information is lost.

Screenshot 2023-03-14 at 16 35 33

I've used the /publisher/swagger API to republish the nodes as the previous diagnosticOutput issue was resolved. I can see in my End-to-end transaction list that it triggered an additional GET call to the /registry. Maybe this has something to do with it?

I hope this helps, if not please do let me know!

JaivyDaam commented 1 year ago

Hi @marcschier, it happened again, this time the edgeHub is showing an error:

Error message

<6> 2023-03-15 15:47:49.010 +00:00 [INF] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.Authenticators.DeviceScopeTokenAuthenticator] - Error authenticating token for DataSetWriterV2_{endpointID} because the token is expired or could not be parsed <7> 2023-03-15 15:47:49.010 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.Authenticators.DeviceScopeTokenAuthenticator] - Error parsing token for DataSetWriterV2_{endpointID} System.UnauthorizedAccessException: The specified SAS token is expired at Microsoft.Azure.Devices.Edge.Util.SharedAccessSignature..ctor(String iotHubName, DateTime expiresOn, String expiry, String keyName, String signature, String encodedAudience) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/SharedAccessSignature.cs:line 36 at Microsoft.Azure.Devices.Edge.Util.SharedAccessSignature.Parse(String iotHubName, String rawToken) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/SharedAccessSignature.cs:line 80 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.Authenticators.DeviceScopeTokenAuthenticator.TryGetSharedAccessSignature(String token, IIdentity identity, SharedAccessSignature& sharedAccessSignature) in /mnt/vss/_work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/authenticators/DeviceScopeTokenAuthenticator.cs:line 160 <6> 2023-03-15 15:47:49.010 +00:00 [INF] [Microsoft.Azure.Devices.Edge.Hub.Core.DeviceScopeAuthenticator] - Credentials for client DataSetWriterV2_{endpointID} are not valid. <7> 2023-03-15 15:47:49.010 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Core.DeviceScopeAuthenticator] - Client DataSetWriterV2_{endpointID} in device scope not reauthenticated locally. <7> 2023-03-15 15:47:49.010 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Core.Authenticator] - DataSetWriterV2_{endpointID} re-authentication failure <4> 2023-03-15 15:47:49.010 +00:00 [WRN] [Microsoft.Azure.Devices.Edge.Hub.Core.ConnectionReauthenticator] - Reauthenticating client DataSetWriterV2_{endpointID} failed, removing client connection <6> 2023-03-15 15:47:49.010 +00:00 [INF] [Microsoft.Azure.Devices.Edge.Hub.Core.ConnectionReauthenticator] - Unable to re-authenticate DataSetWriterV2_{endpointID}, dropping client connection. <6> 2023-03-15 15:47:49.010 +00:00 [INF] [Microsoft.Azure.Devices.Edge.Hub.Core.ConnectionManager] - Removing device connection for device DataSetWriterV2_{endpointID} with removeCloudConnection flag 'True'. <4> 2023-03-15 15:47:49.010 +00:00 [WRN] [EdgeHub] - "Closing connection for device: DataSetWriterV2_{endpointID}, Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device DataSetWriterV2_{endpointID}., " <6> 2023-03-15 15:47:49.010 +00:00 [INF] [Microsoft.Azure.Devices.Edge.Hub.Mqtt.MessagingServiceClient] - Disposing MessagingServiceClient for device Id DataSetWriterV2_{endpointID} because of exception - Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device DataSetWriterV2_{endpointID}. <6> 2023-03-15 15:47:49.010 +00:00 [INF] [Microsoft.Azure.Devices.Edge.Hub.Mqtt.DeviceProxy] - Closing device proxy for device Id DataSetWriterV2_{endpointID} <7> 2023-03-15 15:47:49.031 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient] - Received connection status changed callback with connection status Disabled and reason Client_Close for DataSetWriterV2_{endpointID} <6> 2023-03-15 15:47:49.031 +00:00 [INF] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudProxy] - Closing receiver in cloud proxy 453c4f61-a5da-427c-9263-3c351dbd024d for DataSetWriterV2_{endpointID} <6> 2023-03-15 15:47:49.104 +00:00 [INF] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudProxy] - Error receiving C2D messages for DataSetWriterV2_{endpointID} in cloud proxy 453c4f61-a5da-427c-9263-3c351dbd024d. Closing receive loop. System.ObjectDisposedException: The semaphore has been disposed. at System.Threading.SemaphoreSlim.Release(Int32 releaseCount) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.EnsureSessionIsOpenAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.EnsureMessageReceivingLinkIsOpenAsync(TimeSpan timeout, Boolean enableCallback) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.ReceiveMessageAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.ReceiveAsync(TimeoutHelper timeoutHelper) at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation) at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.<>c__DisplayClass19_0.<b__0>d.MoveNext() --- End of stack trace from previous location --- at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.ReceiveAsync(TimeoutHelper timeoutHelper) at Microsoft.Azure.Devices.Client.InternalClient.ReceiveAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudProxy.CloudReceiver.C2DMessagesLoop(IClient deviceClient) in /mnt/vss/_work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudProxy.cs:line 527 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudProxy.CloudReceiver.C2DMessagesLoop(IClient deviceClient) in /mnt/vss/_work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudProxy.cs:line 541 <7> 2023-03-15 15:47:49.105 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudProxy] - Got a non-recoverable error from client for DataSetWriterV2_{endpointID}. Closing the cloud proxy since it may be in a bad state. System.ObjectDisposedException: The semaphore has been disposed. at System.Threading.SemaphoreSlim.Release(Int32 releaseCount) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.EnsureSessionIsOpenAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.EnsureMessageReceivingLinkIsOpenAsync(TimeSpan timeout, Boolean enableCallback) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.ReceiveMessageAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.ReceiveAsync(TimeoutHelper timeoutHelper) at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation) at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.<>c__DisplayClass19_0.<b__0>d.MoveNext() --- End of stack trace from previous location --- at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.ReceiveAsync(TimeoutHelper timeoutHelper) at Microsoft.Azure.Devices.Client.InternalClient.ReceiveAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudProxy.CloudReceiver.C2DMessagesLoop(IClient deviceClient) in /mnt/vss/_work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudProxy.cs:line 527 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudProxy.CloudReceiver.C2DMessagesLoop(IClient deviceClient) in /mnt/vss/_work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudProxy.cs:line 541 <6> 2023-03-15 15:47:49.107 +00:00 [INF] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudProxy] - Closing receiver in cloud proxy 453c4f61-a5da-427c-9263-3c351dbd024d for DataSetWriterV2_{endpointID}

I've made a 30 minute support bundle, this should cover the process. Please find it here:

support_bundle_2023_03_15_16_14_17_UTC.zip

JaivyDaam commented 1 year ago

Hi,

It happened 3 times today, I really needed some answers so I've dug into the problem. I've found this relevant issue: #1625 , apparently it indicates a connection problem to the cloud.

I have looked at the settings of what the cause might be and I saw that I only added the Cloudflare DNS 1.1.1.1 in the /etc/docker/daemon.json. I've added 8.8.8.8 and 8.8.4.4 to that list.

As the real cause is hard to find, I am monitoring this to see if the issue will persist.

I do wonder though, how this error ends up clearing out the datasetWriter object in the CosmosDB Endpoint Document. But that would be another question for another day :-)

JaivyDaam commented 1 year ago

So, it kept on degrading ever more shorter.

I rebooted the edge completely and it seems to be running stable unable to pinpoint to exact problem. I will let you know when I have more information about this.

Azure / Industrial-IoT

CosmosDB Endpoint Document looses its dataSetWriter object. #1959