bytedeco / javacpp

The missing bridge between Java and native C++
Other
4.46k stars 579 forks source link

JVM stuck forever at `Pointer.physicalBytesInaccurate()` #767

Closed 0x6675636b796f75676974687562 closed 3 weeks ago

0x6675636b796f75676974687562 commented 1 month ago

I'm parsing multiple C++ files with llvm from javacpp-presets.

At some point, after parsing ~100 files (the exact threshold varies), the JVM process gets permanently stuck at Pointer.physicalBytesInaccurate(), not returning even after spending an hour inside this method. The JVM stack trace is:

   java.lang.Thread.State: RUNNABLE
    at app//org.bytedeco.javacpp.Pointer.physicalBytesInaccurate(Native Method)
    at app//org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:705)
    - locked <3542162a> (a java.lang.Class)
    at app//org.bytedeco.javacpp.Pointer.init(Pointer.java:127)
    at app//org.bytedeco.llvm.global.clang.clang_getTypeSpelling(Native Method)
    at app//<private code>
    at app//kotlin.sequences.TransformingSequence$iterator$1.next(Sequences.kt:210)
    at app//kotlin.sequences.SequencesKt___SequencesKt.toCollection(_Sequences.kt:787)
    at app//kotlin.sequences.SequencesKt___SequencesKt.toMutableList(_Sequences.kt:817)
    at app//kotlin.sequences.SequencesKt___SequencesKt.toList(_Sequences.kt:808)
    at app//<private code>
    at app//org.bytedeco.llvm.global.clang.clang_visitChildren(Native Method)
    at app//<private code>
    at platform/jdk.httpserver@17.0.3.1/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:95)
    at platform/jdk.httpserver@17.0.3.1/sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:82)
    at platform/jdk.httpserver@17.0.3.1/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:98)
    at platform/jdk.httpserver@17.0.3.1/sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:733)
    at platform/jdk.httpserver@17.0.3.1/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:95)
    at platform/jdk.httpserver@17.0.3.1/sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:700)
    at java.base@17.0.3.1/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base@17.0.3.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at app//<private code>
    at app//kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)

As you can see, my code is calling clang.clang_getTypeSpelling(), and this attempt hangs while trying to invoke the Pointer.init()Pointer.deallocator()Pointer.physicalBytesInaccurate() chain.

The JVM arguments are as follows:

-server
--enable-preview
-Xss2m
-XX:InitialRAMPercentage=80.0
-XX:MaxRAMPercentage=80.0
-XX:MaxRAM=1073741824
-XX:+UseParallelGC
-XX:+CompactStrings
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-XX:+CreateCoredumpOnCrash
-Dorg.bytedeco.javacpp.maxPhysicalBytes=1073741824
-Dfile.encoding=UTF-8
-Djansi.force=true
-Djdk.attach.allowAttachSelf=
-Dsun.stderr.encoding=UTF-8
-Dsun.stdout.encoding=UTF-8

The maximum process memory to be used by either JVM or Libclang is set to 1 GB (1073741824), and the maximum heap size is set to be 80% of that value (i.e. 786 MB).

The reverse call tree obtained from the profiler:

image

saudet commented 1 month ago

Please try to set the "org.bytedeco.javacpp.nopointergc" system property to "true".

0x6675636b796f75676974687562 commented 1 month ago

Please try to set the "org.bytedeco.javacpp.nopointergc" system property to "true".

Thank you Samuel @saudet, I'll try to and get back with my feedback in a short while.

For what it's worth, the JVM thread is stuck at NtQueryVirtualMemory@ntdll.dll, probably spinning waiting for some condition to be met:

image

Stack:

image

So I think I'll also play with memory settings (-Xmx and org.bytedeco.javacpp.maxPhysicalBytes) and heap/non-heap/native ratio and see whether there's any change.

0x6675636b796f75676974687562 commented 1 month ago

Samuel @saudet, after some time spent searching, it looks like my issue is similar to tensorflow/java#208, and I have two questions:

  1. What's the difference between setting org.bytedeco.javacpp.noPointerGC to true and org.bytedeco.javacpp.maxPhysicalBytes to zero? According to the code, both effectively disable the JavaCPP-triggered garbage collection.
  2. If the garbage collection is disabled, for the native memory to get freed, is it sufficient to treat Pointer descendants as regular AutoCloseable's (i.e. invoke close() in a finally block immediately once I'm done using the object)? The reason I'm asking is that, once I've set noPointerGC, the Working Set size of my Java process continues to grow as the process is running, and the peak Working Set value is now considerably (~1.25x) larger than it used to be with GC enabled (6.2+ GB vs 5.1 GB).

    In my own scenario, I observe the following numbers:

    org.bytedeco.javacpp.noPointerGC org.bytedeco.javacpp.maxPhysicalBytes Peak Working Set
    false 0 5040 MB
    false 4096 MB Process hung at Pointer.physicalBytesInaccurate()
    false 8192 MB 5128 MB
    true 0 4700 MB
    true 6144 MB 6750 MB
    true 8192 MB 6250 MB
saudet commented 1 month ago

When maxPhysicalBytes is 0 it just doesn't try as hard to release memory, that's all.

Yes, Pointer.close() is for that purpose, but it's easier to use PointerScope: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/

0x6675636b796f75676974687562 commented 1 month ago

Samuel @saudet, thank you for your response, I really appreciate your feedback.

Unfortunately, adding PointerScope to the mix didn't change much, probably because we were already properly closing all the pointers we controlled. Yet, quite contrary to the experience of your other users, native memory (Working Set) usage doesn't settle at 1 GB, nor at 4 GB. Instead, it keeps growing:

image

despite our used JVM heap is shallow (< 1 GB):

image

There're indeed minor "drops" in Working Set (or, consequently, Private Bytes values) whenever a Pointer is manually released and/or PointerScope left, but overall the memory keeps growing:

image

Can this issue be caused by a recursive nature of Libclang and its clang_visitChildren() and CXCursorVisitor API? Because in this case, JVM and native stack frames are heavily interleaved. Is it possible that PointerScope is not "visible" across a native stack frame?

Can you suggest how we can further diagnose the problem?

We've tried what looks like all possible combinations of property values:

— but made very little progress so far.

0x6675636b796f75676974687562 commented 1 month ago

Samuel @saudet, a few more observations.

Here's the expected growth of the Working Set in the presence of unidentified memory leaks:

image

If I sprinkle the code with more PointerScope instances here and there, memory leaks don't go away -- instead, this merely slows everything down (as you can see, the graph gets scaled horizontally):

image

Finally, despite memory limits and garbage collection are essentially disabled (maxBytes=0, maxPhysicalBytes=0, maxRetries=0, noPointerGC=true), sometimes JavaCPP may still think it has run out of memory; in this case all useful (application's) I/O stops and CPU usage hits 80% (in native code):

image

saudet commented 1 month ago

In the case of C APIs we need to release memory manually, so please refer to libclang's documentation.

On Fri, Jul 12, 2024, 03:01 Andrey S. @.***> wrote:

Samuel @saudet https://github.com/saudet, a a few more observations.

Here's the expected growth of the Working Set in the presence of unidentified memory leaks:

image.png (view on web) https://github.com/bytedeco/javacpp/assets/73111822/e47ecd7a-4fac-49cc-9282-40139f158ca4

If I sprinkle the code with more PointerScope instances here and there, memory leaks don't go away -- instead, this merely slows everything down (as you can see, the graph gets scaled horizontally):

image.png (view on web) https://github.com/bytedeco/javacpp/assets/73111822/e22c0b70-5889-4627-9a1b-e72bc88ac296

Finally, despite memory limits and garbage collection are essentially disabled (maxBytes=0, maxPhysicalBytes=0, maxRetries=0, noPointerGC= true), sometimes JavaCPP may still think it has run out of memory; in this case all useful (application's) I/O stops and CPU usage hits 80% (in native code):

image.png (view on web) https://github.com/bytedeco/javacpp/assets/73111822/8e7b82dc-9a7e-44f2-8715-d952f5246eb7

— Reply to this email directly, view it on GitHub https://github.com/bytedeco/javacpp/issues/767#issuecomment-2223554956, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZMQF4CKQIQI2PA5CFJHWTZL3CARAVCNFSM6AAAAABKTHCAASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTGU2TIOJVGY . You are receiving this because you were mentioned.Message ID: @.***>

0x6675636b796f75676974687562 commented 1 month ago

By trial and error, I figured out how to prevent memory leaks when using Libclang, despite its documentation is brief and insufficient.

  1. Basically, when parsing C++ code, there're two phases: first we initialize an instance of CXTranslationUnit via clang_parseTranslationUnit() or clang_parseTranslationUnit2(), then we traverse the AST via clang_visitChildren(). Speaking of the first phase, a call to clang_parseTranslationUnit2() should be surrounded with a PointerScope:
        try (final var ignored = new PointerScope()) {
            clang_parseTranslationUnit2(...);
        }
  2. When implementing your CXCursorVisitor, despite the documentation says nothing about it, all three call() arguments should be closed when exiting this method, otherwise memory will be leaking:

    class AstVisitor extends CXCursorVisitor {
        @Override
        public int call(final CXCursor cursor, final CXCursor parent, final CXClientData clientData) {
            try (cursor; parent; clientData) {
                // ...
            }
    
            return CXChildVisit_Recurse;
        }
    }
  3. Additionally, at the beginning of the call() method, entering a new PointerScope is also 100% necessary, probably because the outer ("lower") stack frame is a native one (i.e. call() is directly invoked by the native code). Despite previously registered ("outer") pointer scopes are still visible, having only a single scope per translation unit (i.e., AST tree) rather than per cursor eventually results in 100% usage of all CPU cores — in the native code.
  4. Finally, a buffer of CXToken instances created with clang_tokenize() needs to be properly disposed of via clang_disposeTokens(). If this is not done, memory will leak no matter what. Most Java engineers, myself included, will forget to reset the pointer so that it points to the beginning of the buffer (via CXToken.position(long)). This is very similar in nature to the Buffer.flip() invocation in the Java NIO API. If the pointer is not reset, the call to clang_disposeTokens() will result in a segmentation fault. So the correct usage example would be:
    void forEachToken(
            final CXCursor cursor,
            final Consumer<? super CXToken> action
    ) {
        try (final var extent = clang_getCursorExtent(cursor)) {
            try (final var translationUnit = clang_Cursor_getTranslationUnit(cursor)) {
                try (final var tokens = new CXToken()) {
                    final var tokenCountRef = new int[1];
                    clang_tokenize(translationUnit, extent, tokens, tokenCountRef);
                    final var tokenCount = tokenCountRef[0];
                    try {
                        IntStream.range(0, tokenCount)
                             .mapToObj(tokens::position)
                             .forEach(action);
                    } finally {
                        tokens.position(0L);
                        clang_disposeTokens(translationUnit, tokens, tokenCount);
                    }
                }
            }
        }
    }

One the above is done, the Java application can be safely launched with pointer garbage collection disabled, and in my scenario memory usage stabilizes at around 600 to 700 MB (as opposed to 5 GB with memory leaks):

-Dorg.bytedeco.javacpp.maxBytes=0
-Dorg.bytedeco.javacpp.maxPhysicalBytes=0
-Dorg.bytedeco.javacpp.maxRetries=0
-Dorg.bytedeco.javacpp.noPointerGC=true

I've set up a sample repo with my findings available as a runnable code.

This issue can be closed. Thank you for your support.

saudet commented 1 month ago

Thanks for the detailed explanations! It would be great if your could contribute sample code that demonstrate all this

0x6675636b796f75676974687562 commented 1 month ago

It would be great if your could contribute sample code that demonstrate all this

Definitely. I could merge my source code into a single self-contained Java file and add it to llvm/samples/clang.

Yet, the existing LLVM samples are currently intended to also run on Java 7.

I backported my own samples from Java 17 to Java 7 and 8.

The question is: which version (7, 8, or 17) do you want added to LLVM samples?

saudet commented 1 month ago

It doesn't really matter what version of Java the samples are in, whichever is fine :+1: Although Java 8 is probably the most currently used version, so that would be best I guess?

0x6675636b796f75676974687562 commented 3 weeks ago

Samuel @saudet, here you go:

saudet commented 3 weeks ago

Thanks for the contribution!