Azure / Industrial-IoT

Azure Industrial IoT Platform
MIT License
523 stars 214 forks source link

IoT Edge Publisher trying to connect to invalid endpoint #2178

Closed greg-mcnamara closed 3 months ago

greg-mcnamara commented 8 months ago

Describe the bug I'm currently running the sample installation on Azure. I've manually created a new OPC simulator VM which has a different url from the original simulator. I deleted the existing application and registered the new endpoint. This all works ok, but I notice in the Edge Publisher module logs that it's still trying to connect to the old url/IP. It seems to be holding on to discovery or endpoint connection on the old url, but when I query the registry I can't find any reference to the old endpoint. Any ideas on how to troubleshoot this?

To Reproduce Steps to reproduce the behavior:

  1. Install sample setup on Azure.
  2. Removed the installed OPC simulator VM, and manually set up a new OPC simulator VM.
  3. Delete the registered application for the old OPC simulator.
  4. Register the endpoint for the new OPC simulator.
  5. Query the registry for endpoints, only the new one is listed.
  6. IoT Edge Publisher module logs show connection attempts to the old OPC simulator url.

Expected behavior Publisher would no longer try to connect to the old endpoint after the application/endpoint has been removed from the registry.

marcschier commented 8 months ago

Hi @greg-mcnamara. Could you tell me - what operations did you perform (if any) on the previously registered OPC Simulator? DId you subscribe or call any service API? Or nothing at all.

Then the latter points to the "deactivation" not having worked or working at all as expected (or having happened) as the activate (connect) called upon registration, takes a reference on the client, which keeps it alive until deactivate (and the latter might have failed). This design is very bad. Granted, it tried to mimic the previous 2.8 behavior, but the reference is only live until publisher is rebooting. So might as well not do all of it at all, which is pretty much the fix I will make here - removing all of this "statefulness".

However, if you used the simulator before, could you tell me what you did so I can repro?

If you look at the logs, how many "ref:" references on the "bad" client do you see in the logs (should be on the lines showing the errors)?

greg-mcnamara commented 8 months ago

Thanks @marcschier, I forgot to mention that I replaced the original OPC sim VM with my own on IP 10.1.8.6, and then replaced that with another VM on 10.1.8.5. The OPC server demo I'm working with has a time-limited dev license, which is why I keep on switching VMs... Both have the same Windows device name "vm-packwiseiiot". Maybe that's causing some confusion?

Here are some recurring lines in the publisher edge module logs:

`#1: Failed to connect opc.tcp://10.1.8.6:48010/_0D3343CE_uata7756c8294d5ad809851ff63b34ab65e7d014508 [state:NotReachable|refs:1] to opc.tcp://10.1.8.6:48010/: Error establishing a connection: BadNotConnected...

1: Failed to connect opc.tcp://10.1.8.6:48010/_952F5DFC_uatb5c172c56f77f8afa02d12d1d080c1b0882e81a6 [state:NotReachable|refs:1] to opc.tcp://10.1.8.6:48010/: Error establishing a connection: BadNotConnected...`

What's interesting is that it's trying to connect to 2 different endpoints on the wrong IP address, but the endpoint ID of the first entry above is the same as the endpoint I'm connecting to on the new IP address.

Below is the response from GET \registry\v2\endpoints on the cloud publisher. This configuration works, I can browse nodes on the endpoint and I'm receiving published node update messages. I think when I've swapped OPC servers with the same device name but different IP addresses I've created some confusion for the module. Publisher module restarts do not resolve the problem. Is there a way to reset the config and start from scratch?

{ "items": [ { "registration": { "id": "uata7756c8294d5ad809851ff63b34ab65e7d014508", "endpointUrl": "opc.tcp://vm-packwiseiiot:48010", "siteId": "iothub-zkt2bc_device_linuxgateway0-54zwr5s_module_publisher", "discovererId": "iothub-zkt2bc_device_linuxgateway0-54zwr5s_module_publisher", "endpoint": { "url": "opc.tcp://10.1.8.5:48010/", "alternativeUrls": [ "opc.tcp://10.1.8.5:48010/", "opc.tcp://vm-packwiseiiot:48010" ], "securityMode": "None", "securityPolicy": "http://opcfoundation.org/UA/SecurityPolicy#None", "certificate": "357D3570A0FC8AA9DF00788EB37FFD90F9EA08C8" }, "securityLevel": 0, "authenticationMethods": [ { "id": "Anonymous-[0]-None-None", "credentialType": "None" } ] }, "applicationId": "uas315baf1f453bc61b53697fbfc885641383822046" } ] }

marcschier commented 8 months ago

There is an inherent issue in that the publisher maintains its own view of endpoints and "published nodes" in the underlying json file, and that can get out of sync with the state of "endpoint" entries in IoT Hub. E.g., you can subscribe to a node on an endpoint, which causes the local pn.json file to get updated, then delete the endpoint, and then the contents of the json file are not updated automatically.

As mitigation you can list the content of the file and then remove the removed endpoints manually, but this is not done automatically as one would expect. This is indeed the same (bad) behavior as in 2.8 although here the pn.json was in Cosmos Db.

Doing this automatically and reliably will be a bit of work, e.g., while we could trigger the work to update the pn.json when the endpoint is removed, this could fail (e.g., not connected to publisher), so we would likely want some form of daemon process that keeps all items (apps, endpoints, publishednodes entries) in sync. I will keep this in 2.9.5 but not sure if we find time to do this then, or later.

greg-mcnamara commented 8 months ago

Hi @marcschier, on a similar note, I was wondering how the publisher API service (cloud) assigns new endpoint registrations when there might be multiple endpoints with the same IP/url connected to different edge gateways? For example, if I have 2 edge gateways in 2 different locations (and networks) but the IP addressing for OPC servers is the same at both locations (say both are on 10.0.0.1), and I go to register an endpoint for that IP, how does it know which endpoint to register? Would I also need to specify the gateway/site when registering?

greg-mcnamara commented 7 months ago

Hi @marcschier, just wondering if you had any feedback on my comment above? Is there a way to specify which edge gateway should be used to register an endpoint (where an endpoint with the same IP address is connected to another edge gateway)?

marcschier commented 7 months ago

Sorry about this, I had not seen it.

There is some code in the repo to generate unique SHA identifiers for the endpoints and applications which is used in the web API. Endpoint hashes include the endpoint url and are hashed with the id of its application, so same endpoint will resolve differently if the application description was different. The application uses the product uri and application uri (the latter is supposed to be different from server to server).

So, in your case, if the same server reports the same information to the 2 edges, there will be a single endpoint in the cloud. If there are 2 servers with same IP there should be different application description reported, and then 2 applications with separate endpoints will be registered. Of course, if the server is a clone and reports the same information and sits on the same IP, then there will be conflicts.

greg-mcnamara commented 7 months ago

Thanks @marcschier that makes total sense. I'm working with a SCADA software product which includes an OPC server. From what I can see, the SCADA application isn't reporting unique instance information, but I will work with the vendor to see if that's possible in case we have 2 application instances at the same org/site/plant.

greg-mcnamara commented 7 months ago

Update: the SCADA application will include the computer name in the application URI, so it is distinguishable. Just to confirm, if I register an endpoint in the cloud service with url say opc.tcp://10.0.0.1:48010/, and I have 2 edge gateways that can connect to that same url (2 different OPC servers), will it automatically register both endpoints each with their own application?

marcschier commented 3 months ago

That should be the case if the application Uri is unique for both servers.

marcschier commented 3 months ago

Closing unless i am mistaken and there is an issue here still, in which case let me know.