[Bug]: plc4j-driver-opcua seems to leak memory - Githubissues

apache / plc4x

PLC4X The Industrial IoT adapter

https://plc4x.apache.org/

Apache License 2.0

1.26k stars 401 forks source link

[Bug]: plc4j-driver-opcua seems to leak memory #1101

Closed takraj closed 6 months ago

takraj commented 1 year ago

What happened?

The OPC-UA driver seem to leak memory when closing a connection object. This issue sooner or later lead to application crash with OOM error. Workaround would be to reuse connection objects, but then we face the issue #1100, which I believe renders a fix for this pretty urgent.

Screenshot from 2023-09-23 14-14-10

Example code to reproduce the issue:

PlcDriverManager driverManager = new PlcDriverManager();

AtomicBoolean running = new AtomicBoolean(true);
Runtime.getRuntime().addShutdownHook(new Thread(() -> running.set(true)));

while (running.get()) {
    PlcConnection connection = driverManager.getConnection("opcua:tcp://127.0.0.1:40185/milo?discovery=false");
    try {
        Thread.sleep(100);
    } finally {
        connection.close();
    }
}

Some additional charts

Screenshot from 2023-09-23 14-23-14

Screenshot from 2023-09-23 14-08-05

Screenshot from 2023-09-23 14-10-16

Screenshot from 2023-09-23 14-10-25

Logs

logs-before-oom.txt

Heap dumps

Version

v0.10.0

Programming Languages

[X] plc4j
[ ] plc4go
[ ] plc4c
[ ] plc4net

Protocols

[ ] AB-Ethernet
[ ] ADS /AMS
[ ] BACnet/IP
[ ] CANopen
[ ] DeltaV
[ ] DF1
[ ] EtherNet/IP
[ ] Firmata
[ ] KNXnet/IP
[ ] Modbus
[X] OPC-UA
[ ] S7

chrisdutz commented 1 year ago

I'll have a look at this as it seems none else is currently planning on doing so ... hopefully this doesn't require too much knowledge of the OPC-UA protocol.

chrisdutz commented 1 year ago

Having seen now that you were using the pre 0.10.0 syntax for creating the connection, I think you have stumbled over a problem that we have already addresses in 0.10.0-SNAPSHOT ... we're planning on releasing that in the next few days. Would be cool, if you could have a look if this is fixed for your case too.

I am currently running an updated version of your program in JProfiler and am not seeing any increase in memory usage.

chrisdutz commented 1 year ago

chrisdutz commented 1 year ago

Ok ... after running it for almost two hours ... there's definitely something going on ... not sure if it's because I'm also running the server in the same VM ... It's consuming quite a bit more Memory and a lot more CPU time.

chrisdutz commented 1 year ago

Ok ... just noticed I was running it in the IntelliJ profiler, not JProfiler ... I split up the application into a server (that just starts milo) and a client (with your code) ... when profilling only the client I can see a constant increase in sleeping threads ... I should probably figgure out why this is happening and fix it before the 0.11.0 release ...

chrisdutz commented 1 year ago

And they all seem to be related to the nio event group:

Hmpf ... I thought we had addressed that issue and I guess it probably also has an effect on all other drivers too ... Usually opening and closing connections shouldn't be the default case, as the connection-cache should be used, but I think having somethign stuck will also be the reason why some applications simply don't stop gracefully.

chrisdutz commented 1 year ago

Should have been this image

chrisdutz commented 1 year ago

YAY ... I found it :-) It wasn#t the event loops .. they are correctly closed. It was the NettyHashTimerTimeoutManager that we start for every connection but never explicitly close it ... by overriding the close() method of ChannelDuplexHandler in Plc4xNettyWrapper and explicitly closing it, the leak of open threads is gone ... you can see a commons-pool growing at the start, but as soon as it's reached it's 11 threads the thread-count stays constant :-)

chrisdutz commented 1 year ago

chrisdutz commented 1 year ago

Please give the current 0.11.0-SNAPSHOT version a try and check if the problem is gone.

takraj commented 1 year ago

@chrisdutz I have tried the revision tagged with v0.11.0, but the problem does not seem to be resolved yet. :(

~/plc4x$ git log -1
commit d22042ef079ae93120770a95704571118d9c871e (HEAD, tag: v0.11.0)
Author: Christofer Dutz <cdutz@apache.org>
Date:   Mon Oct 2 09:51:59 2023 +0200

    [maven-release-plugin] prepare release v0.11.0

As you can see on the screenshot below, the heap usage is still constantly increasing:

Tested with the following code:

public class MemoryLeakDemo {

    public static void main(String[] args) throws Exception {
        PlcDriverManager driverManager = new DefaultPlcDriverManager();
        PlcConnectionManager connectionManager = driverManager.getConnectionManager();

        AtomicBoolean running = new AtomicBoolean(true);
        Runtime.getRuntime().addShutdownHook(new Thread(() -> running.set(true)));

        while (running.get()) {
            PlcConnection connection = connectionManager.getConnection("opcua:tcp://127.0.0.1:12686/milo?discovery=false");
            try {
                Thread.sleep(100);
            } finally {
                connection.close();
            }
        }
    }
}

qtvbwfn commented 10 months ago

I think I have already handled the issue of thread pool leakage, but I'm not sure how to release the related resources.

V0.11.0

please help me @chrisdutz @takraj @hutcheb

dump file:

qtvbwfn commented 10 months ago

What happened?

The OPC-UA driver seem to leak memory when closing a connection object. This issue sooner or later lead to application crash with OOM error. Workaround would be to reuse connection objects, but then we face the issue #1100, which I believe renders a fix for this pretty urgent.

Example code to reproduce the issue:
PlcDriverManager driverManager = new PlcDriverManager();

AtomicBoolean running = new AtomicBoolean(true);
Runtime.getRuntime().addShutdownHook(new Thread(() -> running.set(true)));

while (running.get()) {
    PlcConnection connection = driverManager.getConnection("opcua:tcp://127.0.0.1:40185/milo?discovery=false");
    try {
        Thread.sleep(100);
    } finally {
        connection.close();
    }
}
Some additional charts

Logs

logs-before-oom.txt

Heap dumps

MemoryLeakDemo_1045304_23_09_2023_12_41_27.zip

MemoryLeakDemo_1045304_23_09_2023_13_02_21.zip

MemoryLeakDemo_1045304_23_09_2023_14_05_15.zip

Version

v0.10.0

Programming Languages

[x] plc4j

[ ] plc4go

[ ] plc4c

[ ] plc4net

Protocols

[ ] AB-Ethernet

[ ] ADS /AMS

[ ] BACnet/IP

[ ] CANopen

[ ] DeltaV

[ ] DF1

[ ] EtherNet/IP

[ ] Firmata

[ ] KNXnet/IP

[ ] Modbus

[x] OPC-UA

[ ] S7 hi: It took me about three days, and I have fixed the thread and memory overflow issues in the V0.11.0 version. Can you create a new branch 0.11.1-SNAPSHOT for me to submit my code? The main issue is that the keepAlive in SecureChannel causes the channel to be occupied for a long time without being released. @chrisdutz @takraj @hutcheb

chrisdutz commented 10 months ago

I the world of plcs generally connections should be reused, as for most connections establishing a connection is pretty cost intensive. I know that opc-ua needs to resolve some things when connecting.

So I agree, that both issues should be addressed.

Just not sure when. If nobody else takes on the issue, generally opc-ua related issues are pretty low on my personal priority list 😉

christofe-lintermans-actemium commented 9 months ago

Hi All,

I have the same issue, but when using Apache NiFi integration.

Should I create a new issue, or is this issue 1101 sufficient?

Could a priority be assigned to address this issue?

Thank you in advance!

chrisdutz commented 9 months ago

I would strongly assume it's the same issue, as from my knowledge interenally all integration modules use the same components. So this issue should probably be observalble in all integration modules.

splatch commented 9 months ago

@christofe-lintermans-actemium I believe that #1007 might address some of these issues as it cleans up some of driver internals which could loose consistency over last two releases (0.10-0.11) and fall behind with stability. I am awaiting for feedback on field testing with real devices which should be completed this or next week after which I would like to merge these changes to develop. Then the Apache Nifi team will be able to produce new development build which should be free of (major) memory leaks.

chrisdutz commented 9 months ago

Aaaah ... cool ... well in that case ... I guess as soon as these changes make it back to develop, sounds like a good time for releasing ;-) (I guess we've got enough stuff in there already)

christofe-lintermans-actemium commented 6 months ago

Hi everyone,

I have updated PLC4X to version 0.12 which contains the commits for #1007. Unfortunately, the memory leak issue persists.

Does anyone have insights into the root cause of this problem? I am willing to investigate further.

I aim to use PLC4X exclusively as the source/sink library for all my data projects. For S7 and Modbus protocols, the performance and stability are satisfactory, if not for occasional connection interruptions but are handled by NiFi's timeout settings. However, the Out of Memory (OOM) issue with OPC UA is particularly problematic.

Kind regards,

Christofe

splatch commented 6 months ago

However, the Out of Memory (OOM) issue with OPC UA is particularly problematic.

I have hoped that changes we did before 0.11 and later 0.12 would help, but they didn't. Looking at the graph and also brief verification you did with S7 and Modbus it is clear that we have a memory (or other resource) leak within opc ua protocol logic itself. One bug which I know is present is connection hanging when authentication token expires, the refresh of token results in no further traffic on the connection. It could be that we keep generating packets, but OPC server do not answer them leading to growth of heap and crash. The default lifetime I saw with some PLCs is somewhere between 1h15m-1h30m (75-90 minutes) which seem to match, but not exactly, graph you have. The first period in your last snapshot, from 13:30 to 14:00 seem stable, leak gets visible from 14:15. Second period is stable just from 16:15 to 16:30. It gets into troubles instantly after that, as the available reclaimed memory is smaller and smaller.

I did collect some memory dumps with OPC UA simulator, however it did not reveal anything suspicious with few tags. @christofe-lintermans-actemium can you share how many tags you poll and how subscriptions are divided? I will try to build a reproducer too.

qtvbwfn commented 6 months ago

I previously attempted to address this issue on version 0.10.0, where the main problem resided in OpcuaProtocolLogic. It occurred when calling the close method on the PlcConnection, resulting in resources not being properly released. Subsequently, due to significant API changes, I did not pursue further investigation at that time. Since then, for OPCUA handling, I have consistently used the Eclipse Milo client.

@splatch

christofe-lintermans-actemium commented 6 months ago

I previously attempted to address this issue on version 0.10.0, where the main problem resided in OpcuaProtocolLogic. It occurred when calling the close method on the PlcConnection, resulting in resources not being properly released. Subsequently, due to significant API changes, I did not pursue further investigation at that time. Since then, for OPCUA handling, I have consistently used the Eclipse Milo client.

@splatch

Thank you for the info. PLC4X uses eclipse milo underlying for the OPCUA driver.

So I want to avoid a proprietary implementation for OPC UA.

splatch commented 6 months ago

PLC4X uses eclipse milo underlying for the OPCUA driver.

@christofe-lintermans-actemium It used to in 0.8 release. Since 0.10 there is in-house implementation of PLC4X driver which rely on same infrastructure as S7 and Modbus drivers.

@qtvbwfn thank you for the hint, I'll check what's going on there!

christofe-lintermans-actemium commented 6 months ago

@christofe-lintermans-actemium It used to in 0.8 release. Since 0.10 there is in-house implementation of PLC4X driver which rely on same infrastructure as S7 and Modbus drivers.

I didn't look the code in depth yet :)

Is it ok that the unused dependencies are removed to avoid confusion in the future?

splatch commented 6 months ago

I see where the gap between s7 and opcua is, working on test right now.

// EDIT @christofe-lintermans-actemium Milo is still being used in test scope, so we have it there for unit tests. We might change that to isolated test container, but this change is not there yet.

qtvbwfn commented 6 months ago

By testing the latest version and tracking the logs of the OPC UA SERVER, it was found that two connections were created each time, but only one was closed upon termination.

@splatch @chrisdutz @takraj @hutcheb

splatch commented 6 months ago

Looks like pre-flight discovery connection might be part of the trouble.

splatch commented 6 months ago

Please have a look on PR I just created, it makes sure that discovery connection, transaction manager and a lot of threads gets shut.

Chris fixes from few months ago worked fine, however while working on UA security I fixed discovery connection which made its resources hang again. With above fix thread count stays flat, even if new connection is created every second:

christofe-lintermans-actemium commented 6 months ago

@splatch

Can you guide me on how to build the nar file for NiFi including this fix? I already checked out the branch locally. I can quickly test it on the system.

Thank you in advance.

Christofe

splatch commented 6 months ago

I need to tweak tests, cause they caught this change (which is good!) and got stuck while checking lifecycle. Beyond that it works.

You can checkout issue/1101 branch and build & deploy nifi adapter. If nifi has its own component, just swap its version to 0.13-SNAPSHOT after you built a branch to pull artifacts through your local repo.

christofe-lintermans-actemium commented 6 months ago

could it be that the integrations is moved to another repo https://github.com/apache/plc4x-extras?

chrisdutz commented 6 months ago

Yes ... we split up the repo to separate the integration modules, additional tools and examples from the main repository.

christofe-lintermans-actemium commented 6 months ago

Hi all,

I have checked out the specific branch:

I have built the nifi integrations and deployed it:

I didn't had any OOM problems anymore

(red line is time when I replaced the plc4x nar)

Thank you very much for your support, much appreciated!