FasterXML / jackson-core

Core part of Jackson that defines Streaming API as well as basic shared abstractions
Apache License 2.0
2.27k stars 793 forks source link

InternCache - replace synchronized with `ReentrantLock` #1251

Closed pjfanning closed 7 months ago

pjfanning commented 7 months ago

If we are going all-in on ReentrantLocks - I would still like to do each change its own PR so that each change gets looked at on its own merits.

cowtowncoder commented 7 months ago

I am fine with this if there's something to suggest performance is better (even if just for newer JVMs).

pjfanning commented 7 months ago

I am fine with this if there's something to suggest performance is better (even if just for newer JVMs).

@cowtowncoder so you need micro benchmarks to prove this?

cowtowncoder commented 7 months ago

@pjfanning I'd be fine with tests done for similar use case somewhere else; article etc. Just in case there are counter-examples. Or more like sanity check so we aren't just making things worse with no benefit.

So no, I don't absolutely require one specifically for this change.

Although... if easy enough, it'd of course be great and this case might be easy enough.

pjfanning commented 7 months ago

@cowtowncoder I can guarantee that you can write use cases where you can get worse performance (older JDKs - and probably also in newer JDKs if you don't use virtual threads) and better performance (virtual threads in latest JDKs). I think it is fairly pointless. We need to make a decision whether we want to make VirtualThread users happy while imposing a small penalty on all other users - or whether we want to code so that everyone is happy by making this configurable.

cowtowncoder commented 7 months ago

@pjfanning Ok I was hoping for "tends to be no-slower and possibly faster on newer JVMs" kind of general results. Or even just "not meaningfully slower", given it's useful for virtual threads.

This particular thing is difficult to wire so I am not really looking forward to doing that.

With that... yeah, jmh based test for 2 cases would be great, perhaps for

https://github.com/FasterXML/jackson-benchmarks/

with whatever threads (100?), interning a repeating set of Strings.

We don't have to decide on this right today either, can keep this open to discuss etc. I only hesitate because this is potential hot spot for performance.

cowtowncoder commented 7 months ago

Wow. This thing ("ReentrantLock vs synchronized") has been around for a while, so many hits from 10 years ago :)

cowtowncoder commented 7 months ago

Hmmh. Not much useful info in articles like:

https://medium.com/javarevisited/synchronized-vs-reentrantlock-how-to-choose-cfb5306080e7

which basically suggests synchronized fine except for heavily-contested cases. Which may or may not be the case here -- in normal use case InternCache is only called when symbol table lookup fails, which is a sign of poor JsonFactory reuse. (it is also called for JsonDeserializers by databind -- but similarly deserializers are cached so should be light load).

Then again, thinking out aloud: since it only matters in high-contention case, and rarely for "normal" (good) usage, there is little downside to doing this. So I am leaning towards merging this change: we do get reports of high lock contention cases from time to time.

pjfanning commented 7 months ago

I've done some testing in my own jmh project - https://github.com/pjfanning/jackson-nest-bench/blob/main/src/jmh/java/org/example/jackson/bench/InternCacheBench.java

With 5 threads, I'm finding that the InternCache with ReentrantLock calls like in this PR is slower than synchronized.

I did make a change to use tryLock instead so that if another thread has the lock, then the current thread skips the clear section. This runs faster. It does mean that under heavy contention that the cache can grow past the max size but that before too long some thread will get the lock and clear down the cache.

It would take a pretty extraordinary set of circumstances for the cache size to grow a large amount over the max size. If this was a real worry, we could do something like this (I don't think this extra complexity is warranted).

// GUIDELINE_MAX_ENTRIES = 180
// ABSOLUTE_MAX_ENTRIES = 200
// in new code, we should never get more than ABSOLUTE_MAX_ENTRIES but once we get to GUIDELINE_MAX_ENTRIES
// we do try to clear() but won't wait to get a block unless we've reached the ABSOLUTE_MAX_ENTRIES
        if (size() >= GUIDELINE_MAX_ENTRIES) {
            if (_lock.tryLock()) {
                try {
                    if (size() >= GUIDELINE_MAX_ENTRIES) {
                        clear();
                    }
                } finally {
                    _lock.unlock();
                }
            } else if (size() >= ABSOLUTE_MAX_ENTRIES) {
                _lock.lock();
                try {
                    if (size() >= GUIDELINE_MAX_ENTRIES) {
                        clear();
                    }
                } finally {
                    _lock.unlock();
                }
            }
        }
cowtowncoder commented 7 months ago

yeah no let's keep things simple. Thank you for checking out performance aspects with jmh benchmark.

I think I'm ok with slower operation for low concurrency cases.

One ask: could you add an entry in release-notes/VERSION-2.x for this change (ok to refer to PR)?

pjfanning commented 7 months ago

@cowtowncoder this is the simpler change that I would like to make to this PR. It can let the cache grow a little bit bigger than the max but it is faster than the existing change in this PR.

        if (size() >= MAX_ENTRIES) {
            if (lock.tryLock()) {
                try {
                    if (size() >= MAX_ENTRIES) {
                        clear();
                    }
                } finally {
                    lock.unlock();
                }
            }
        }
cowtowncoder commented 7 months ago

@pjfanning yes that's fine; max size enforcement can be approximate, and if this is faster for common case, good.

Just let me know when this is ready (I guess once converted to non-Draft), will merge.

pjfanning commented 7 months ago

@cowtowncoder ready for review

cowtowncoder commented 7 months ago

Will also increase default size from 100 to 200, see #1257. Seems rather low, for global cache. Possibly relevant for abnormal cases of no/low symbol-table reuse.

zshao9 commented 6 months ago

@cowtowncoder @pjfanning When will we make a release that includes this patch? This is a crucial perf improvement that we are looking forward to!

cowtowncoder commented 6 months ago

@zshao9 Original intent was to add this in 2.18.0; but since it appears safe, I think we could consider backporting into 2.17.1.

But I am curious as to whether this really is important: have you measured its impact with load testing? My experience has been that finding this path in code as hot spot tends to indicate different problem altogether -- specifically lack of reuse of JsonFactory and/or ObjectMappers.

cowtowncoder commented 6 months ago

@pjfanning WDYT? If we think this is good patch (I think so) and safe, maybe it should be backported in 2.17.1, which I hope to release next week.

pjfanning commented 6 months ago

v2.17.1 already has a lot of changes. Without an actual description of why this is actually affecting anyone in a serious way, I think we should wait till v2.18.0.

cowtowncoder commented 6 months ago

@pjfanning Yes, valid point. Will wait for actual supporting evidence before considering backporting.

zshao9 commented 6 months ago

@cowtowncoder @pjfanning Our main use case is in an Apache Spark executor where 32 threads (on 32 vcores) can be busy parsing large JSON documents with Spark's UDF called GET_JSON_OBJECT, which internally uses jackson-core. We have seen that 31 out of 32 threads are waiting for the lock while 1 thread is holding it! I would say this is a huge performance issue for us.

Reference: https://issues.apache.org/jira/browse/SPARK-47959

cowtowncoder commented 6 months ago

@zshao9 Ok but more importantly, is there a way to validate that the change would help? Maybe could test with 2.18.0-SNAPSHOT of jackson-core that has it? (build locally or use Sonatype OSS repo's auto-built). Without some form of validation this is speculative fix and as such not worth backporting into patch update (but good enough for minor version as we do not think there are negative effects).

But equally importantly: this could be a sign of lack of reuse of ObjectMapper (most common) -- or directly not reusing JsonFactory. In both cases the problem is that the underlying symbol table is not getting reused (reuse would prevent need to re-intern() Strings once decoded from byte[] or char[]); but since InternCache is global that can get contention. But reducing that contention is only small part of significant performance gain of properly reusing underlying JsonFactory (or more specifically, symbol table it uses); either directly or indirectly via ObjectMapper.

zshao9 commented 6 months ago

@vadim has helped us to disable Jackson Field Name Interning, and the problem disappeared entirely.

But equally importantly: this could be a sign of lack of reuse of ObjectMapper (most common) -- or directly not reusing JsonFactory.

I am not familiar with the logic to decide what keys to cache and intern. What if the keys in the Json Object are non-repeated UUID strings?

cowtowncoder commented 6 months ago

@zshao9 Setting is global to JsonFactory so you have only decision for whether to canonicalize and/or intern everything or not.

Since you hare having performance problems, I would recommend first turning of canonicalization and seeing how that affects performance; and then turning of interning and doing the same. Sounds like you have done that. But I think in addition to disabling interning, you might want to also see about disabling canonicalization; that might be even better option.

Basically there are 2 main possibilities as to why you are having issues with concurrency:

  1. Your symbol tables are not being reused -- either JsonFactory is not reused, or JsonParsers are not being close()d (they are automatically closed if reading the end-of-input; otherwise need to call parser.close().
  2. Number of symbols in documents is huge -- symbol table maximum size is about 6,000 -- and they do not repeat (at all, or not a lot: UUIDs are unique, for example)

In first case, your real problem would be solved by reuse or ensuring closing of parsers and NOT changing interning/canonicalization. In second case, you would benefit from changing these settings.

I have yet to see a case where locking in InternCache is not based on one of (1) or (2): it really is a symptom usually.

vadim commented 6 months ago

Hey @zshao9 and @cowtowncoder -- As @JoshRosen mentioned here, it does not seem like the option to disable canonicalization is present in jackson-core==2.15.2. I believe his experiments showed no effect when trying to toggle this on 2.15.2, while interning removed the bottleneck faced by @zshao9 . Any thoughts here on versioning?

cowtowncoder commented 6 months ago

@vadim Both options have been around for very long time, without many changes. So I don't think Jackson version should matter a lot here.

Difference between canonicalization disabling (no help?) and interning (helps) is interesting -- not sure when/how that would happen. Either way symbol table use is likely not working optimally for the use case.

I would then go with disabling interning since that helps. Improvement listed here will go in 2.18.0 but in general is unlikely to be super important compared to other options (IMO).