Closed JaivyDaam closed 1 year ago
Hi @JaivyDaam, I cannot think of anything other than the API to unpublish the nodes must have been called. There is no path that "automatically" updates the CosmosDB documents in the code base itself. Since the publisher is just observing the changes in the database it makes sense that it stopped sending data when the configuration is deleted. The only approach I can see to lock this down is to enable diagnostics on the data plane for Cosmos DB: https://learn.microsoft.com/en-us/azure/cosmos-db/monitor-resource-logs?tabs=azure-portal and see who is changing the content of the document.
Hi @marcschier, I've luckily already enabled the cosmosDB diagnostics and was able to search for the external IP that called upon the CosmosDB. This indeed was the app service running the all-in-one container. Looking upon the transactions, it seems that some-thing/one indeed did a post the publisher bulk endpoint. The logs says that:
device: PC
and ip: 0.0.0.0
which leads me to believe someone actually used the swagger API to edit these endpoints, after asking internally none have said they did nor can I imagine they did.
Just in case, I turned off the Engineering Tool and re-added the jobconfiguration to the CosmosDB.
Since then, no issues has been found.
I'll continue to digest the logs for whenever it happens again. If you have a recommendation on how to enable request bodies that the "all-in-one" container receives/sends and/or the CosmosDB receives, I'd be glad to hear it but I am unable to find it.
I'll keep monitoring for now and thank you for your swift reply!
Hi @marcschier, I've come back to you with more information.
After disabling the Engineering Tool and rechecking our code (No service has a route the "all-in-one" container) the datasetwriter was still deleted from CosmosDB.
I've made some screenshots regarding the End-to-end transactions hoping to provide you with some additional information. Obviously they have been anonymised, but I can provide you with some text that goes with the screenshots.
I've used the /publisher/swagger
API to republish the nodes as the previous diagnosticOutput issue was resolved. I can see in my End-to-end transaction list that it triggered an additional GET
call to the /registry
. Maybe this has something to do with it?
I hope this helps, if not please do let me know!
Hi @marcschier, it happened again, this time the edgeHub is showing an error:
I've made a 30 minute support bundle, this should cover the process. Please find it here:
Hi,
It happened 3 times today, I really needed some answers so I've dug into the problem. I've found this relevant issue: #1625 , apparently it indicates a connection problem to the cloud.
I have looked at the settings of what the cause might be and I saw that I only added the Cloudflare DNS 1.1.1.1
in the /etc/docker/daemon.json
. I've added 8.8.8.8
and 8.8.4.4
to that list.
As the real cause is hard to find, I am monitoring this to see if the issue will persist.
I do wonder though, how this error ends up clearing out the datasetWriter object in the CosmosDB Endpoint Document. But that would be another question for another day :-)
So, it kept on degrading ever more shorter.
I rebooted the edge completely and it seems to be running stable unable to pinpoint to exact problem. I will let you know when I have more information about this.
Describe the bug The CosmosDB looses its endpoint configuration for
JobConfiguration.Job.writerGroup.dataSetWriters
object. This seems at random interval, as we don't have this in our PoC. It might be specific to our case and I don't have a clear way of reproducing this bug.We noticed as we didn't receive any messages anymore in our eventhub. Upon investigating, I've noticed that the job config was empty.
Expected behavior I expect that the object is not deleted in the CosmosDB.
Additional context I've added the logs that have a 5 minute span, in the publisher log you can see that it cancels the job.
[2023-03-10 11:19:31.461 INF Microsoft.Azure.IIoT.Agent.Framework.Agent.Worker+JobProcess] Job {endpointID} cancelled.
I restarted it soon after. Support bundle below:
support_bundle_2023_03_10_11_31_28_UTC.zip