OPCFoundation / UA-.NETStandard

OPC Unified Architecture .NET Standard
Other
1.89k stars 924 forks source link

Client cannot reconnect to an OPC UA Server hosted in Kubernetes #2391

Open htbmw opened 7 months ago

htbmw commented 7 months ago

Type of issue

Current Behavior

I containerized and deployed the reference opcua server (from this repo) to kubernetes. I also developed and deployed a connector (opcua client using nuget packages from this repo) to kubernetes and connect to this opcua server. Unfortunately the reconnect behaviour doesn't work with the reference opcua server running in kubernetes.

When I stop the opcua server, I can see in my connector (client) logs that the connection goes down and that it is waiting to reconnect. When I start the opcua server, the connector never reconnects. This reconnect feature works perfectly in a non-container environment (local) as well as in a docker environment.

When I connect this same connector (from kubernetes) to a physical PLC, the reconnect works perfectly.

Expected Behavior

When I deploy the reference opcua server to kubernetes, I expect an opcua client to be able to successfully reconnect to it after stopping and starting the reference opcua server.

Steps To Reproduce

1) Containerize and deploy the reference opcua server (from this repo) and deploy it to kubernetes 2) Deploy a service for the opcua server so that a client can connect to it 3) Deploy an opcua client (using reconnect workflow from this repo's stack) to kubernetes 4) Connect the client and server 5) Restart the server and observe the reconnect behaviour of the client

Environment

- OS:
- Environment:
- Runtime:
- Nuget Version:
- Component:
- Server:
- Client:

Anything else?

No response

mregen commented 7 months ago

Hi @htbmw , two q:

htbmw commented 7 months ago

Good day @mregen ,

When I deploy the opc ua server, the service map to the same ports. Here is an example of the service:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: opcua-server
  name: opcua-server
spec:
  ports:
    - name: "62541"
      port: 62541
      targetPort: 62541
  selector:
    app: opcua-server

I haven't considered the certificate. I guess it gets recreated because in my deployment for the server, I do not mount a volume and I do not do anything extra to manage certificates. Is this perhaps the reason it cannot reconnect?

I will check if I can persist and reuse the certificate in k8s.

mregen commented 7 months ago

Hi @htbmw , the ref server supports a special option to map the config file and the cert stores to a persistent file when using docker. As described here: https://github.com/OPCFoundation/UA-.NETStandard/blob/master/Docs/ContainerReferenceServer.md But I haven't tried it with Kubernetes.

htbmw commented 7 months ago

Thanks @mregen , I will take a look and report back.

htbmw commented 7 months ago

Hi @mregen,

Thanks for the tip regarding the certificate.

I managed to mount a persistent volume for the opcua server in k8s and also confirmed that the volume was set up correctly.

But, on a pod restart, I noticed that the new pod still generated a new certificate. I could see this by opening an interactive shell on the pod and listing the files in the folder ../OPC Foundation/pki/own/certs/. On each restart a new certificate file was added in addition to older certificate files.

By default, k8s uses a dynamic hostname that changes on each restart, and this is what caused the certificate to be recreated on each pod restart. To solve this, I explicitly set the hostname of the container to a static value in the deployment manifest, and upon restart, the server will use the existing certificate that was created on the first startup.

Finally I managed to get the client to reconnect to the server after I restart the server's pod.

But, there is still one thing that does not look right: The reconnect only happens after the session's OperationTimeout expires. The default value I had for this setting was 120 seconds (2 min).

The reconnect does not seem to honor the reconnectPeriod value I specify to the reconnectHandler.BeginReconnect method:

var reconnectPeriod = 5000 //5 seconds
var reconnectHandler = new SessionReconnectHandler();
reconnectHandler.BeginReconnect(session, reconnectPeriod, ServerReconnectComplete!);

I added a setting to my client wherein I could override the OperationsTimeout value and could confirm that the reconnect happens much quicker if I set it to 5 seconds instead of 120 seconds.

What am I still missing? Why is it not reconnecting within the period specified by the reconnectPeriod? If the OperationTimeout is set too short, I guess long-running requests like browse requests my unnecessarily timeout, correct?

mregen commented 7 months ago

Hi @htbmw, please check the sample code [here]:(https://github.com/OPCFoundation/UA-.NETStandard/blob/8a2cf92710e85f04fbd12bc5bb58fbefaea42d7c/Applications/ConsoleReferenceClient/UAClient.cs#L268)

the reconnect is triggered by a keep alive error, so it depends on the period you set for the keep alive timer. 5000ms is a good setting.

The SessionReconnectHandler has been reworked to support a mode where it can run as a single instance, but it is backward compatible. But the mode with the single instance is the recommened way, as in the updated sample.

It also supports exponential back off.

Changing the OperationTimeout is not the recommended way.

htbmw commented 7 months ago

Thanks @mregen,

How recent was this rework of the SessionReconnectHandler? We've been using this SessionReconnectHandler for a long time now, but it appears to be very much the same as the sample. There is no definite 1 to 1 mapping between our code and the sample because we have another wrapper around the reconnect handler for testing purposes.

I am not sure what exactly is different between the way it worked before the rework, so if you could please point me to a commit I can check myself.

mregen commented 7 months ago

Hi @htbmw , the improved 'SessionReconnectHandler' is designed to be backward compatible to work in the way where every reconnect requires a new instance. The new mode supports that only a single instance is used in the client which can be retriggered after a reconnect was successful. Also there is a state variable which can be inspected by the client, and the reconnect timing can follow an exponential backoff scheme. The rational for supporting this new mode was to simplify the use pattern, as I debugged user code which created multiple instances of the SessionReconnectHandler instances trying to reconnect at the same time and to simplify the use pattern by having only a single instance that can be retriggered. I would also like to point you to the handover after successful reconnection. The client is responsible for disposing old sessions, but only if a new session instance was created. Hope this helps.

htbmw commented 7 months ago

Thanks @mregen for explaining.

I appreciate the rationale behind simplifying the use pattern. I suspect my implementation might also be too complex.

I will do some testing and check if this works in kubernetes.