apache / plc4x

PLC4X The Industrial IoT adapter
https://plc4x.apache.org/
Apache License 2.0
1.26k stars 401 forks source link

[Bug]: OPC UA connection is not funktioning anymore after 1 hour and 15 minutes #1364

Closed RobinVDP closed 3 months ago

RobinVDP commented 8 months ago

What happened?

Isseu

When executing a read request every second in an ethernal loop After 1 hour and 15 minutes the connection is not funktioning anymore when executing a read request every second in an ethernal loop.

readRequest.execute().get()

When adding a timeout off 1000 ms to the get statement a exception is thrown. After this the connection has to be terminated and restarted in order to contineu reading data from the OPC UA Server.

2024-01-18 16:35:30,658 ERROR [org.apa.plc.jav.opc.pro.OpcuaSubscriptionHandle] (pool-10-thread-1) Timeout while waiting for subscription response: java.util.concurrent.TimeoutException at org.apache.plc4x.java.spi.netty.NettyHashTimerTimeoutManager.lambda$0(NettyHashTimerTimeoutManager.java:54) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) at java.base/java.lang.Thread.run(Thread.java:840)

This also happens after 1 hour and 15 minutes when trying to retrieve data with a subscription.

Tested OPC UA Servers

Siemens S7-1500 Lokale OPC UA Server running on the same PC

Version

v0.11.0

Programming Languages

Protocols

splatch commented 8 months ago

I believe it might be related to resource leak. There were some issues which been addressed by #1108 and #1139. We do have some bugs related to internal thread management of driver.

Can you please try running locally built PR #1007? I believe it might greatly improve stability of opcua driver due to changes to thread management. Two days ago I've got a thumbs up about this PR working fine with secured S7-1200, making a two hour test with S7-1500 is something we can't do (I don't owe such hardware). The PR itself is about to be merged, but knowing how it works with your case will let me prioritize work on eventual bug fixes (if they still needed).

Cheers, Łukasz

RobinVDP commented 8 months ago

We build the branch locally (ConnectorIO:pg/security-policy) and try to run our application but we are receiving following error. Unsupported transport tcp We saw there where some security policy changes and tried adding securityPolicy=NONE and messageSecurity=NONE. But without succes, are we overlooking some changes or aditional configuration parameters?

chrisdutz commented 8 months ago

Just had a look and it seems that you need to add a dependency on the TCP transport ... this error is shown, if the service loader can't find a Transport named "tcp".

RobinVDP commented 8 months ago

Indeed the transportLoader seems empty while debugging and the for loop is skipped. The connectionStringwe are passing to DefaultPlcDriverManager().getConnection is: opcua:tcp://localhost:48020?discovery=false It's the same connectionString we used in release 0,11,0 which is working.

chrisdutz commented 8 months ago

You probably only need a dependency to this:

<dependency>
  <groupId>org.apache.plc4x</groupId>
  <artifactId>plc4j-transport-tcp</artifactId>
  <version>0.12.0-SNAPSHOT</version>
</dependency>
RobinVDP commented 8 months ago

Yes adding the dependency solved it, your branch now running with a local OPC UA Server. I'll report back in a few hours if it's still running.

If that functions I will also test it with the Seimens S7 1500 on Tuesday when I'm in the office and report back.

RobinVDP commented 8 months ago

Saddly the isseu is not solved and the connection still stops working

splatch commented 8 months ago

That's pity. Can you share which server you used for testing? I tested PR in question with prosys simulation server, .net reference server and s7-1200. I would like to hunt for this bug and get it solved, but first need to confirm reference environment where I could hit reproduction criteria.

RobinVDP commented 8 months ago

Besides testing it with a S7-1500 we tested it using OPC UA ANSI C Demo Server (Windows) from unified-automation. https://www.unified-automation.com/downloads/opc-ua-servers.html

We tried reading out following tags cyclically reading them every second or with a subscription

ns=4;s=Demo.Dynamic.Scalar.Boolean ns=4;s=Demo.Dynamic.Scalar.Int16 ns=4;s=Demo.Dynamic.Scalar.UInt16 ns=4;s=Demo.Dynamic.Scalar.String

While testing with your branch the OPC UA server started throwing following messages shortly after starting:

07:40:20.697Z|E|588C* UaServer_Publish: List of stored NotificationMessages is full for SubscriptionId 548127632, discarding message 1 07:40:21.695Z|E|588C* UaServer_Publish: List of stored NotificationMessages is full for SubscriptionId 548127632, discarding message 2 07:40:22.698Z|E|588C* UaServer_Publish: List of stored NotificationMessages is full for SubscriptionId 548127632, discarding message 3

splatch commented 8 months ago

Hey Robin, Thanks for sharing this details. I have not known this animal yet, will give it a spin. I've touched subscription handling logic (actually the event poll trigger), so I'm likely guilty of the behavior change in this area.

RobinVDP commented 8 months ago

I also tested it with our S7-1500 today to see if there was any difference or additional error messages. But unfortunately, the connection also stops working with a S7-1500.

iamaccosta commented 8 months ago

Hi, I am an external User and I am having this problem too. I've also been struggling with another issue that I am not sure if it is related to this one. When any of the requirements of the Tasks fails, the tasks stop (e.g. when I turn off the VPN connection), but when I recover all the requirements, I would like the Tasks to return to the normal, but currently I have to restart it manually. Is there any configuration that allows that recovery or is it a problem?

splatch commented 8 months ago

The mentioned PR got merged. I've found there one issue with resource cleanup which got fixed in #1401. I'll have a look on demo server this week and see why we are failing to satisfy its requirements.

RobinVDP commented 8 months ago

Unfortunately, the issue is not resolved in release 0.12.0

splatch commented 7 months ago

Hello @RobinVDP, can you give a try to PR which I just linked? I found place where outstanding acknowledges are lost. I believe this is main reason why subscriptions stopped working.

EDIT: Made a test run with simulator, connection/subscription gets again stale after 90 minutes.

splatch commented 7 months ago

Ok, I've found reason - refreshing of OPC UA token turns subscription into "stale" mode. I've made second run with trace logging and found this:

2024-03-15 12:47:49,038 DEBUG | plc4x-tm-thread-2 | org.apache.plc4x.java.opcua.context.Conversation | Submitting Transaction to TransactionManager 2705, security channel 20, token 1  (subscription publish request)
2024-03-15 12:47:49,502 DEBUG | nioEventLoopGroup-3-1 | org.apache.plc4x.java.opcua.context.Conversation | Received response made of 136 bytes for message id: 2705, channel id:20, token:1 (subscription publish response)
2024-03-15 12:47:49,504 DEBUG | plc4x-tm-thread-3 | org.apache.plc4x.java.opcua.context.SecureChannel | Sending open secure channel message to opc.tcp://arch:4840/OPCUA/SimulationServer (token refreshing)

The last message marks a moment when subscription stopped to work. This part was touched with UA security updates, however scope of changes in this area was fairly limited. Bug is confirmed and identified - issue happens when client attempts to renew security token. It is unrelated to subscriptions, it is related to session lifetime.

splatch commented 6 months ago

Besides testing it with a S7-1500 we tested it using OPC UA ANSI C Demo Server (Windows) from unified-automation. https://www.unified-automation.com/downloads/opc-ua-servers.html

We tried reading out following tags cyclically reading them every second or with a subscription

ns=4;s=Demo.Dynamic.Scalar.Boolean ns=4;s=Demo.Dynamic.Scalar.Int16 ns=4;s=Demo.Dynamic.Scalar.UInt16 ns=4;s=Demo.Dynamic.Scalar.String

While testing with your branch the OPC UA server started throwing following messages shortly after starting:

07:40:20.697Z|E|588C* UaServer_Publish: List of stored NotificationMessages is full for SubscriptionId 548127632, discarding message 1 07:40:21.695Z|E|588C* UaServer_Publish: List of stored NotificationMessages is full for SubscriptionId 548127632, discarding message 2 07:40:22.698Z|E|588C* UaServer_Publish: List of stored NotificationMessages is full for SubscriptionId 548127632, discarding message 3

FYI I've pushed fix for this issue. I haven't had time yet to fix token refresh issue, so leaving issue open for now.

chrisdutz commented 5 months ago

What's the status of this?

splatch commented 5 months ago

Its pending, I closed it by accident by pushing related stability fixes, but these did not solve issue fully. Latest merge @ottlukas did on python side pulled my commit message and accidentally closed issue again. Sorry, it was my fault first.