FasterXML / jackson-dataformats-binary

Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile
Apache License 2.0
310 stars 133 forks source link

Fix jackson ion memory leak issue #306

Closed neoremind closed 2 years ago

neoremind commented 2 years ago

I wanted to report a memory leak issue of ion when uncompressing gzip payload.

How to reproduce the problem

Write a simple program to serialize data with gzip compressed content, IonReader is able to tell the compression format and create corresponding GzipInputStream, but unfortunately IonParser's close() method leaks unclosed GzipInputStream.

If you enable Native Memory Tracking (NMT) by adding tuning flags -XX:NativeMemoryTracking=summary in JVM options, for instance, native memory information might look like below.

# jcmd <pid> VM.native_memory
Total: reserved=23462331KB, committed=22132799KB
-                 Java Heap (reserved=20971520KB, committed=20971520KB)
                            (mmap: reserved=20971520KB, committed=20971520KB)

-                     Class (reserved=1067371KB, committed=19819KB)
                            (classes #3034)
                            (malloc=363KB #2769)
                            (mmap: reserved=1067008KB, committed=19456KB)

-                    Thread (reserved=185682KB, committed=185682KB)
                            (thread #181)
                            (stack: reserved=184856KB, committed=184856KB)
                            (malloc=627KB #1065)
                            (arena=199KB #348)

-                      Code (reserved=250772KB, committed=9752KB)
                            (malloc=1172KB #2385)
                            (mmap: reserved=249600KB, committed=8580KB)

-                        GC (reserved=864015KB, committed=864015KB)
                            (malloc=53007KB #16552)
                            (mmap: reserved=811008KB, committed=811008KB)

-                  Compiler (reserved=283KB, committed=283KB)
                            (malloc=141KB #272)
                            (arena=142KB #15)

-                  Internal (reserved=75194KB, committed=75194KB)
                            (malloc=75162KB #11649)
                            (mmap: reserved=32KB, committed=32KB)

-                    Symbol (reserved=5339KB, committed=5339KB)
                            (malloc=3765KB #25797)
                            (arena=1575KB #1)

-    Native Memory Tracking (reserved=975KB, committed=975KB)
                            (malloc=20KB #215)
                            (tracking overhead=955KB)

-               Arena Chunk (reserved=220KB, committed=220KB)
                            (malloc=220KB)

-                   Unknown (reserved=40960KB, committed=0KB)
                            (mmap: reserved=40960KB, committed=0KB)

From the output, JVM only takes about 23 GB in total including ~20G heap and ~3G metadata space + thread stack + code cache + symbols + GC + direct memory.

But when I run pmap -x <pid> | sort -nr -k2 to check native memory usage. As a result, there are many anon memory chunk with size of 64MB which being filled up completely (or in pair, there could be chunk X and Y where sizeof(X+Y)=64MB).

00000002c0000000 20973824 10624288 10624288 rw---   [ anon ]
00000007c0240000 1046272       0       0 -----   [ anon ]
00007ffa280fd000  744460   15372   15372 rw---   [ anon ]
00007ffacd840000  237312       0       0 -----   [ anon ]
00007ff780000000  131056  131040  131040 rw---   [ anon ]
00007ffa01430000  110400     376       0 r---- locale-archive
00007ff748000000   72300   70376   70376 rw---   [ anon ]
00007ff958000000   65536   65488   65488 rw---   [ anon ]
00007ff94c000000   65536   65488   65488 rw---   [ anon ]
00007ff948000000   65536   65488   65488 rw---   [ anon ]
00007ff944000000   65536   65484   65484 rw---   [ anon ]
00007ff940000000   65536   65516   65516 rw---   [ anon ]
00007ff93c000000   65536   65516   65516 rw---   [ anon ]
00007ff938000000   65536   65488   65488 rw---   [ anon ]
00007ff934000000   65536   65492   65492 rw---   [ anon ]
00007ff928000000   65536   65472   65472 rw---   [ anon ]
00007ff91c000000   65536   65492   65492 rw---   [ anon ]
00007ff874000000   65536   65528   65528 rw---   [ anon ]
00007ff790000000   65536   65528   65528 rw---   [ anon ]
00007ff754000000   65536    1248    1248 rw---   [ anon ]
00007ff950000000   65532   65480   65480 rw---   [ anon ]
00007ff7e8000000   65532   65524   65524 rw---   [ anon ]
00007ff7dc000000   65532   65504   65504 rw---   [ anon ]
00007ff7d4000000   65532   65516   65516 rw---   [ anon ]
00007ff7c0000000   65532   65524   65524 rw---   [ anon ]
00007ff778000000   65532   65524   65524 rw---   [ anon ]
00007ff72c000000   65532   50588   50588 rw---   [ anon ]
00007ff880000000   65528   65520   65520 rw---   [ anon ]
00007ff87c000000   65528   65520   65520 rw---   [ anon ]
00007ff878000000   65528   65520   65520 rw---   [ anon ]
00007ff870000000   65528   65508   65508 rw---   [ anon ]
00007ff86c000000   65528   65520   65520 rw---   [ anon ]
00007ff868000000   65528   65488   65488 rw---   [ anon ]
00007ff864000000   65528   65520   65520 rw---   [ anon ]
00007ff860000000   65528   65516   65516 rw---   [ anon ]
00007ff858000000   65528   65520   65520 rw---   [ anon ]
00007ff850000000   65528   65520   65520 rw---   [ anon ]
00007ff818000000   65528   65520   65520 rw---   [ anon ]
00007ff7ec000000   65528   65492   65492 rw---   [ anon ]

The machine runs the program has CPU of 8 physical cores (16core HT) with 64GB physical memory, the JVM is

openjdk version "1.8.0_302"
OpenJDK Runtime Environment (build 1.8.0_302-b08)
OpenJDK 64-Bit Server VM (build 25.302-b08, mixed mode)

Take a snapshot of any of memory chunks by running below command to check what stores inside.

gdb --batch --pid 51689 -ex "dump memory test.dum 0x7ff874000000  0x7ff87400ffff"

Execute view to see human-readable content, as a result it is extacly what we ser/deserlized for.

The default native memory allocator is glibc's malloc, JVM will call malloc backed by glibc in order to allocate memory. Basically, to optimize for SMP architecture with multiple threads, glibc will scale up the number of available arenas to do allocation from to avoid lock contention. An arena is 64MB large, the upper limit is to create 8 times the number of cores arena. Arena will be created on demand when a thread access an arena that is already locked so it grows the time, here with 16 cores CPU, the upper limit is 8GB.

Let's do a math, JVM manages 23GB, since we run 128 threads, which could make up to 8*16=128 arena that takes up to 8GB, the maximum size would be ~31GB, but in my test environment, it could occupy up to 36GB and grows as time passes. This is because when glibc arena is full due to memory leak, it will extend a new heap based on the old heap (with pointer points to the new one), so the number of heap could exceed the limit of 8 times number of cores (refer to link Thread Arena with multiple heap part )

How to verify the fix?

After the fix, rerun the program, memory excessive consumption disappears, and pmap shows the RES of those 64MB memory chunks are mostly 0.

00007fdd68021000   65404       0       0 -----   [ anon ]
00007fdd64021000   65404       0       0 -----   [ anon ]
00007fdd60021000   65404       0       0 -----   [ anon ]
00007fdd5c021000   65404       0       0 -----   [ anon ]
00007fdd58021000   65404       0       0 -----   [ anon ]

Why the problem is hard to find?

If I run jmap histo:live <pid>, there will be many Finalizer instances. This is because java.util.zip.Inflater leverages finalize method to free underlying heap or mmap allocated memory (see below JNI code).

public void end() {
        synchronized (zsRef) {
            long addr = zsRef.address();
            zsRef.clear();
            if (addr != 0) {
                end(addr);
                buf = null;
            }
        }
    }

    /**
     * Closes the decompressor when garbage is collected.
     */
    protected void finalize() {
        end();
    }

Inflater only frees memory by calling end() http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/native/java/util/zip/Inflater.c

JNIEXPORT void JNICALL
Java_java_util_zip_Deflater_end(JNIEnv *env, jclass cls, jlong addr)
{
    if (deflateEnd((z_stream *)jlong_to_ptr(addr)) == Z_STREAM_ERROR) {
        JNU_ThrowInternalError(env, 0);
    } else {
        free((z_stream *)jlong_to_ptr(addr));
    }
}

From garbage collect’s point of view, JVM creates a “watchdog” for each one of the Finalizable instances, which is a instance of Finalizer. When Inflater is eligible for GC, GC will add Finalizer instances (here it is Inflater) to a special queue named java.lang.ref.Finalizer.ReferenceQueue. Then the object is marked as finalized. There is a Finalizer daemon thread running a loop to dequeue Finalizer instances from the queue and execute finalize() method. So Inflater will be reclaimed by GC to clean and end() does have chance to be executed, and the underlying heap or mmap could be freed up.

It is turn out to be an indeterministic problem, if Finalizer daemon thread calls finalize() method faster than we actually create new instances of Inflater, it could work. But is not guaranteed, because:

When GC is not able to catch up with the Inflater creation speed, Inflaters cannot be cleaned up quickly, somehow the balance is very fragile, memory allocated could grow up and once it grew up, it could be managed by memory allocator like glibc for reusing purpose which may never return back to OS, so the overall memory of the process grows over time.

So, the leaked memory won’t be freed until JVM GC Finalizer runs, which could explain the symptoms why the memory leak grows over time at a low speed.

neoremind commented 2 years ago

@cowtowncoder could you help me review this PR?

cowtowncoder commented 2 years ago

@neoremind Yes, I will -- it may take a bit due to holidays but I will prioritize this as it looks like an easy but important patch. Thank you very much for submitting it!

One quick thing: unless we have a CLA from you, we would need one from:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

so I can accept the contribution. This only ever needs to be done once and CLA works for all future contributions. The usual simple(st) way is to print the 1-page document, fill & sign, scan/photo, email to info at fasterxml dot com.

Looking forward to merging this patch and thank you once again!

neoremind commented 2 years ago

@cowtowncoder I have signed the contributor agreement. I understand this is holiday period, so no hurry and take your time to help me review. Thank you very much and happy holidays! 😄

cowtowncoder commented 2 years ago

@neoremind Thank you for sending the CLA, I got it. But now I realized I do need to get someone with more knowledge with IonReader to code review this.

@mcliedtke or @jobarr-amzn could you help here (or know someone who can)?

My main question is that ownership of passed-in IonReader: with other Jackson modules there is distinction between input source "owned" by Jackson (which Jackson then ought to close), and others that are passed-in but should only be closed if a feature is specified. For JSON, for example, we have:

        if (_inputStream != null) {
            if (_ioContext.isResourceManaged() || isEnabled(Feature.AUTO_CLOSE_SOURCE)) {
                _inputStream.close();
            }

and it might be necessary to do something similar. Or maybe instances should always be closed; this depends on intent, and how readers are constructed and passed to IonParser.

A quick look suggests it is probably safe (and/or things are more complicated), as we have calls like:

    protected JsonParser _createParser(Reader r, IOContext ctxt)
        throws IOException
    {
        return new IonParser(_system.newReader(r), _system, ctxt, getCodec(), _ionParserFeatures);
    }

in which actual underlying input source is further wrapped by IonReader.

cowtowncoder commented 2 years ago

Will wait for a bit, but unless I hear anything against the patch will probably merge. However, will likely only do against 2.13 as there are no plans for further 2.12 releases.

tgregg commented 2 years ago

@cowtowncoder I'll represent the ion-java development team on this one.

Your concerns about ownership of the IonReader are valid. It would not be correct for the IonParser to close() IonReader instances provided by the user. However, the IonParser should close IonReader instances that it creates and uses only internally (such as in the example you gave, where the user provided a Reader).

The tricky part is that some of the data sources the user can provide are themselves Closeable (Reader and InputStream). When an IonReader is created to wrap a Closeable data source, IonReader.close() will call close() on that data source. My understanding is this would violate Jackson's API contract if the user has disabled Feature.AUTO_CLOSE_SOURCE (which is enabled by default). It seems this could be avoided if ion-java provided an IonReaderBuilder option similar to Feature.AUTO_CLOSE_SOURCE, but I haven't yet thought through all the implications of this.

Absent such an option, it feels like we'd have to either 1) deprecate the IonFactory methods that accept such resources and require the user to provide their own IonReader over those resources. The user would retain the responsibility for closing such IonReader instances passed into Jackson factory methods. Or, 2) require Feature.AUTO_CLOSE_SOURCE to be true when using these methods, which could break existing users.

@neoremind While we're sorting this out, can you use the workaround of creating your JsonParser via IonFactory.createParser(IonReader), providing an IonReader that your application instantiates and closes?

cowtowncoder commented 2 years ago

Thank you @tgregg. This also makes me think that changes may need to go in 2.14 -- deprecation markers for sure cannot be added in a patch.

neoremind commented 2 years ago

@tgregg @cowtowncoder Thanks for your deep dive into this patch. I will investigate on the workaround of creating JsonParser via IonFactory.createParser(IonReader), providing an IonReader that my application instantiates and closes.

I agree that who is the authority of IonReader should be responsible to close the resource. I understand the patch we implement currently might cause of misbehavior in terms of configuration semantics. Anyway, Java finalizer could be the fallback mechanism to guardrail the release of resource, I suppose we are able to get by without too much churn. However the risk to prod stability is still there, hope this issue could be solved in 2.13 and 2.14, so that the risk alongside will be mitigated for all Ion users.

cowtowncoder commented 2 years ago

I will defer on this to Ion maintainers; will be happy to help as needed but will not merge patch without approval. So please let me know if I am needed for merging etc based on decisions.

cowtowncoder commented 2 years ago

@tgregg @neoremind There probably hasn't been any progress here, but thought I'd ping -- I am happy to help if I can. 2.14 is still a bit off so there is no hurry but obviously would be nice to proceed if there is something to improve here. Or if not, maybe close the PR.

cowtowncoder commented 2 years ago

Closing in favor of #325.