Closed takraj closed 6 months ago
I'll have a look at this as it seems none else is currently planning on doing so ... hopefully this doesn't require too much knowledge of the OPC-UA protocol.
Having seen now that you were using the pre 0.10.0 syntax for creating the connection, I think you have stumbled over a problem that we have already addresses in 0.10.0-SNAPSHOT ... we're planning on releasing that in the next few days. Would be cool, if you could have a look if this is fixed for your case too.
I am currently running an updated version of your program in JProfiler and am not seeing any increase in memory usage.
Ok ... after running it for almost two hours ... there's definitely something going on ... not sure if it's because I'm also running the server in the same VM ... It's consuming quite a bit more Memory and a lot more CPU time.
Ok ... just noticed I was running it in the IntelliJ profiler, not JProfiler ... I split up the application into a server (that just starts milo) and a client (with your code) ... when profilling only the client I can see a constant increase in sleeping threads ... I should probably figgure out why this is happening and fix it before the 0.11.0 release ...
And they all seem to be related to the nio event group:
Hmpf ... I thought we had addressed that issue and I guess it probably also has an effect on all other drivers too ... Usually opening and closing connections shouldn't be the default case, as the connection-cache should be used, but I think having somethign stuck will also be the reason why some applications simply don't stop gracefully.
Should have been this image
YAY ... I found it :-) It wasn#t the event loops .. they are correctly closed. It was the NettyHashTimerTimeoutManager that we start for every connection but never explicitly close it ... by overriding the close() method of ChannelDuplexHandler in Plc4xNettyWrapper and explicitly closing it, the leak of open threads is gone ... you can see a commons-pool growing at the start, but as soon as it's reached it's 11 threads the thread-count stays constant :-)
Please give the current 0.11.0-SNAPSHOT version a try and check if the problem is gone.
@chrisdutz I have tried the revision tagged with v0.11.0, but the problem does not seem to be resolved yet. :(
~/plc4x$ git log -1
commit d22042ef079ae93120770a95704571118d9c871e (HEAD, tag: v0.11.0)
Author: Christofer Dutz <cdutz@apache.org>
Date: Mon Oct 2 09:51:59 2023 +0200
[maven-release-plugin] prepare release v0.11.0
As you can see on the screenshot below, the heap usage is still constantly increasing:
Tested with the following code:
public class MemoryLeakDemo {
public static void main(String[] args) throws Exception {
PlcDriverManager driverManager = new DefaultPlcDriverManager();
PlcConnectionManager connectionManager = driverManager.getConnectionManager();
AtomicBoolean running = new AtomicBoolean(true);
Runtime.getRuntime().addShutdownHook(new Thread(() -> running.set(true)));
while (running.get()) {
PlcConnection connection = connectionManager.getConnection("opcua:tcp://127.0.0.1:12686/milo?discovery=false");
try {
Thread.sleep(100);
} finally {
connection.close();
}
}
}
}
I think I have already handled the issue of thread pool leakage, but I'm not sure how to release the related resources.
V0.11.0
please help me @chrisdutz @takraj @hutcheb
dump file:
What happened?
The OPC-UA driver seem to leak memory when closing a connection object. This issue sooner or later lead to application crash with OOM error. Workaround would be to reuse connection objects, but then we face the issue #1100, which I believe renders a fix for this pretty urgent.
Example code to reproduce the issue:
PlcDriverManager driverManager = new PlcDriverManager(); AtomicBoolean running = new AtomicBoolean(true); Runtime.getRuntime().addShutdownHook(new Thread(() -> running.set(true))); while (running.get()) { PlcConnection connection = driverManager.getConnection("opcua:tcp://127.0.0.1:40185/milo?discovery=false"); try { Thread.sleep(100); } finally { connection.close(); } }
Some additional charts
Logs
Heap dumps
- MemoryLeakDemo_1045304_23_09_2023_12_41_27.zip
- MemoryLeakDemo_1045304_23_09_2023_13_02_21.zip
- MemoryLeakDemo_1045304_23_09_2023_14_05_15.zip
Version
v0.10.0
Programming Languages
- [x] plc4j
- [ ] plc4go
- [ ] plc4c
- [ ] plc4net
Protocols
- [ ] AB-Ethernet
- [ ] ADS /AMS
- [ ] BACnet/IP
- [ ] CANopen
- [ ] DeltaV
- [ ] DF1
- [ ] EtherNet/IP
- [ ] Firmata
- [ ] KNXnet/IP
- [ ] Modbus
- [x] OPC-UA
- [ ] S7 hi: It took me about three days, and I have fixed the thread and memory overflow issues in the V0.11.0 version. Can you create a new branch 0.11.1-SNAPSHOT for me to submit my code? The main issue is that the keepAlive in SecureChannel causes the channel to be occupied for a long time without being released. @chrisdutz @takraj @hutcheb
I the world of plcs generally connections should be reused, as for most connections establishing a connection is pretty cost intensive. I know that opc-ua needs to resolve some things when connecting.
So I agree, that both issues should be addressed.
Just not sure when. If nobody else takes on the issue, generally opc-ua related issues are pretty low on my personal priority list 😉
Hi All,
I have the same issue, but when using Apache NiFi integration.
Should I create a new issue, or is this issue 1101 sufficient?
Could a priority be assigned to address this issue?
Thank you in advance!
I would strongly assume it's the same issue, as from my knowledge interenally all integration modules use the same components. So this issue should probably be observalble in all integration modules.
@christofe-lintermans-actemium I believe that #1007 might address some of these issues as it cleans up some of driver internals which could loose consistency over last two releases (0.10-0.11) and fall behind with stability. I am awaiting for feedback on field testing with real devices which should be completed this or next week after which I would like to merge these changes to develop. Then the Apache Nifi team will be able to produce new development build which should be free of (major) memory leaks.
Aaaah ... cool ... well in that case ... I guess as soon as these changes make it back to develop, sounds like a good time for releasing ;-) (I guess we've got enough stuff in there already)
Hi everyone,
I have updated PLC4X to version 0.12 which contains the commits for #1007. Unfortunately, the memory leak issue persists.
Does anyone have insights into the root cause of this problem? I am willing to investigate further.
I aim to use PLC4X exclusively as the source/sink library for all my data projects. For S7 and Modbus protocols, the performance and stability are satisfactory, if not for occasional connection interruptions but are handled by NiFi's timeout settings. However, the Out of Memory (OOM) issue with OPC UA is particularly problematic.
Kind regards,
Christofe
However, the Out of Memory (OOM) issue with OPC UA is particularly problematic.
I have hoped that changes we did before 0.11 and later 0.12 would help, but they didn't. Looking at the graph and also brief verification you did with S7 and Modbus it is clear that we have a memory (or other resource) leak within opc ua protocol logic itself. One bug which I know is present is connection hanging when authentication token expires, the refresh of token results in no further traffic on the connection. It could be that we keep generating packets, but OPC server do not answer them leading to growth of heap and crash. The default lifetime I saw with some PLCs is somewhere between 1h15m-1h30m (75-90 minutes) which seem to match, but not exactly, graph you have. The first period in your last snapshot, from 13:30 to 14:00 seem stable, leak gets visible from 14:15. Second period is stable just from 16:15 to 16:30. It gets into troubles instantly after that, as the available reclaimed memory is smaller and smaller.
I did collect some memory dumps with OPC UA simulator, however it did not reveal anything suspicious with few tags. @christofe-lintermans-actemium can you share how many tags you poll and how subscriptions are divided? I will try to build a reproducer too.
I previously attempted to address this issue on version 0.10.0, where the main problem resided in OpcuaProtocolLogic. It occurred when calling the close method on the PlcConnection, resulting in resources not being properly released. Subsequently, due to significant API changes, I did not pursue further investigation at that time. Since then, for OPCUA handling, I have consistently used the Eclipse Milo client.
@splatch
I previously attempted to address this issue on version 0.10.0, where the main problem resided in OpcuaProtocolLogic. It occurred when calling the close method on the PlcConnection, resulting in resources not being properly released. Subsequently, due to significant API changes, I did not pursue further investigation at that time. Since then, for OPCUA handling, I have consistently used the Eclipse Milo client.
@splatch
Thank you for the info. PLC4X uses eclipse milo underlying for the OPCUA driver.
So I want to avoid a proprietary implementation for OPC UA.
PLC4X uses eclipse milo underlying for the OPCUA driver.
@christofe-lintermans-actemium It used to in 0.8 release. Since 0.10 there is in-house implementation of PLC4X driver which rely on same infrastructure as S7 and Modbus drivers.
@qtvbwfn thank you for the hint, I'll check what's going on there!
@christofe-lintermans-actemium It used to in 0.8 release. Since 0.10 there is in-house implementation of PLC4X driver which rely on same infrastructure as S7 and Modbus drivers.
I didn't look the code in depth yet :)
Is it ok that the unused dependencies are removed to avoid confusion in the future?
I see where the gap between s7 and opcua is, working on test right now.
// EDIT @christofe-lintermans-actemium Milo is still being used in test scope, so we have it there for unit tests. We might change that to isolated test container, but this change is not there yet.
By testing the latest version and tracking the logs of the OPC UA SERVER, it was found that two connections were created each time, but only one was closed upon termination.
@splatch @chrisdutz @takraj @hutcheb
Looks like pre-flight discovery connection might be part of the trouble.
Please have a look on PR I just created, it makes sure that discovery connection, transaction manager and a lot of threads gets shut.
Chris fixes from few months ago worked fine, however while working on UA security I fixed discovery connection which made its resources hang again. With above fix thread count stays flat, even if new connection is created every second:
@splatch
Can you guide me on how to build the nar file for NiFi including this fix? I already checked out the branch locally. I can quickly test it on the system.
Thank you in advance.
Christofe
I need to tweak tests, cause they caught this change (which is good!) and got stuck while checking lifecycle. Beyond that it works.
You can checkout issue/1101
branch and build & deploy nifi adapter. If nifi has its own component, just swap its version to 0.13-SNAPSHOT after you built a branch to pull artifacts through your local repo.
could it be that the integrations is moved to another repo https://github.com/apache/plc4x-extras?
Yes ... we split up the repo to separate the integration modules, additional tools and examples from the main repository.
Hi all,
I have checked out the specific branch:
I have built the nifi integrations and deployed it:
I didn't had any OOM problems anymore
(red line is time when I replaced the plc4x nar)
Thank you very much for your support, much appreciated!
What happened?
The OPC-UA driver seem to leak memory when closing a connection object. This issue sooner or later lead to application crash with OOM error. Workaround would be to reuse connection objects, but then we face the issue #1100, which I believe renders a fix for this pretty urgent.
Example code to reproduce the issue:
Some additional charts
Logs
logs-before-oom.txt
Heap dumps
Version
v0.10.0
Programming Languages
Protocols