Closed pjfanning closed 7 months ago
I am fine with this if there's something to suggest performance is better (even if just for newer JVMs).
I am fine with this if there's something to suggest performance is better (even if just for newer JVMs).
@cowtowncoder so you need micro benchmarks to prove this?
@pjfanning I'd be fine with tests done for similar use case somewhere else; article etc. Just in case there are counter-examples. Or more like sanity check so we aren't just making things worse with no benefit.
So no, I don't absolutely require one specifically for this change.
Although... if easy enough, it'd of course be great and this case might be easy enough.
@cowtowncoder I can guarantee that you can write use cases where you can get worse performance (older JDKs - and probably also in newer JDKs if you don't use virtual threads) and better performance (virtual threads in latest JDKs). I think it is fairly pointless. We need to make a decision whether we want to make VirtualThread users happy while imposing a small penalty on all other users - or whether we want to code so that everyone is happy by making this configurable.
@pjfanning Ok I was hoping for "tends to be no-slower and possibly faster on newer JVMs" kind of general results. Or even just "not meaningfully slower", given it's useful for virtual threads.
This particular thing is difficult to wire so I am not really looking forward to doing that.
With that... yeah, jmh
based test for 2 cases would be great, perhaps for
https://github.com/FasterXML/jackson-benchmarks/
with whatever threads (100?), interning a repeating set of Strings.
We don't have to decide on this right today either, can keep this open to discuss etc. I only hesitate because this is potential hot spot for performance.
Wow. This thing ("ReentrantLock vs synchronized") has been around for a while, so many hits from 10 years ago :)
Hmmh. Not much useful info in articles like:
https://medium.com/javarevisited/synchronized-vs-reentrantlock-how-to-choose-cfb5306080e7
which basically suggests synchronized
fine except for heavily-contested cases.
Which may or may not be the case here -- in normal use case InternCache
is only called when symbol table lookup fails, which is a sign of poor JsonFactory
reuse.
(it is also called for JsonDeserializer
s by databind -- but similarly deserializers are cached so should be light load).
Then again, thinking out aloud: since it only matters in high-contention case, and rarely for "normal" (good) usage, there is little downside to doing this. So I am leaning towards merging this change: we do get reports of high lock contention cases from time to time.
I've done some testing in my own jmh project - https://github.com/pjfanning/jackson-nest-bench/blob/main/src/jmh/java/org/example/jackson/bench/InternCacheBench.java
With 5 threads, I'm finding that the InternCache with ReentrantLock calls like in this PR is slower than synchronized.
I did make a change to use tryLock instead so that if another thread has the lock, then the current thread skips the clear section. This runs faster. It does mean that under heavy contention that the cache can grow past the max size but that before too long some thread will get the lock and clear down the cache.
It would take a pretty extraordinary set of circumstances for the cache size to grow a large amount over the max size. If this was a real worry, we could do something like this (I don't think this extra complexity is warranted).
// GUIDELINE_MAX_ENTRIES = 180
// ABSOLUTE_MAX_ENTRIES = 200
// in new code, we should never get more than ABSOLUTE_MAX_ENTRIES but once we get to GUIDELINE_MAX_ENTRIES
// we do try to clear() but won't wait to get a block unless we've reached the ABSOLUTE_MAX_ENTRIES
if (size() >= GUIDELINE_MAX_ENTRIES) {
if (_lock.tryLock()) {
try {
if (size() >= GUIDELINE_MAX_ENTRIES) {
clear();
}
} finally {
_lock.unlock();
}
} else if (size() >= ABSOLUTE_MAX_ENTRIES) {
_lock.lock();
try {
if (size() >= GUIDELINE_MAX_ENTRIES) {
clear();
}
} finally {
_lock.unlock();
}
}
}
yeah no let's keep things simple. Thank you for checking out performance aspects with jmh benchmark.
I think I'm ok with slower operation for low concurrency cases.
One ask: could you add an entry in release-notes/VERSION-2.x
for this change (ok to refer to PR)?
@cowtowncoder this is the simpler change that I would like to make to this PR. It can let the cache grow a little bit bigger than the max but it is faster than the existing change in this PR.
if (size() >= MAX_ENTRIES) {
if (lock.tryLock()) {
try {
if (size() >= MAX_ENTRIES) {
clear();
}
} finally {
lock.unlock();
}
}
}
@pjfanning yes that's fine; max size enforcement can be approximate, and if this is faster for common case, good.
Just let me know when this is ready (I guess once converted to non-Draft), will merge.
@cowtowncoder ready for review
Will also increase default size from 100 to 200, see #1257. Seems rather low, for global cache. Possibly relevant for abnormal cases of no/low symbol-table reuse.
@cowtowncoder @pjfanning When will we make a release that includes this patch? This is a crucial perf improvement that we are looking forward to!
@zshao9 Original intent was to add this in 2.18.0; but since it appears safe, I think we could consider backporting into 2.17.1.
But I am curious as to whether this really is important: have you measured its impact with load testing? My experience has been that finding this path in code as hot spot tends to indicate different problem altogether -- specifically lack of reuse of JsonFactory
and/or ObjectMapper
s.
@pjfanning WDYT? If we think this is good patch (I think so) and safe, maybe it should be backported in 2.17.1, which I hope to release next week.
v2.17.1 already has a lot of changes. Without an actual description of why this is actually affecting anyone in a serious way, I think we should wait till v2.18.0.
@pjfanning Yes, valid point. Will wait for actual supporting evidence before considering backporting.
@cowtowncoder @pjfanning Our main use case is in an Apache Spark executor where 32 threads (on 32 vcores) can be busy parsing large JSON documents with Spark's UDF called GET_JSON_OBJECT
, which internally uses jackson-core. We have seen that 31 out of 32 threads are waiting for the lock while 1 thread is holding it! I would say this is a huge performance issue for us.
Reference: https://issues.apache.org/jira/browse/SPARK-47959
@zshao9 Ok but more importantly, is there a way to validate that the change would help? Maybe could test with 2.18.0-SNAPSHOT of jackson-core that has it? (build locally or use Sonatype OSS repo's auto-built). Without some form of validation this is speculative fix and as such not worth backporting into patch update (but good enough for minor version as we do not think there are negative effects).
But equally importantly: this could be a sign of lack of reuse of ObjectMapper
(most common) -- or directly not reusing JsonFactory
. In both cases the problem is that the underlying symbol table is not getting reused (reuse would prevent need to re-intern() Strings once decoded from byte[]
or char[]
); but since InternCache
is global that can get contention.
But reducing that contention is only small part of significant performance gain of properly reusing underlying JsonFactory
(or more specifically, symbol table it uses); either directly or indirectly via ObjectMapper
.
@vadim has helped us to disable Jackson Field Name Interning, and the problem disappeared entirely.
But equally importantly: this could be a sign of lack of reuse of ObjectMapper (most common) -- or directly not reusing JsonFactory.
I am not familiar with the logic to decide what keys to cache and intern. What if the keys in the Json Object are non-repeated UUID strings?
@zshao9 Setting is global to JsonFactory
so you have only decision for whether to canonicalize and/or intern everything or not.
Since you hare having performance problems, I would recommend first turning of canonicalization and seeing how that affects performance; and then turning of interning and doing the same. Sounds like you have done that. But I think in addition to disabling interning, you might want to also see about disabling canonicalization; that might be even better option.
Basically there are 2 main possibilities as to why you are having issues with concurrency:
JsonFactory
is not reused, or JsonParser
s are not being close()
d (they are automatically closed if reading the end-of-input; otherwise need to call parser.close()
.In first case, your real problem would be solved by reuse or ensuring closing of parsers and NOT changing interning/canonicalization. In second case, you would benefit from changing these settings.
I have yet to see a case where locking in InternCache
is not based on one of (1) or (2): it really is a symptom usually.
Hey @zshao9 and @cowtowncoder -- As @JoshRosen mentioned here, it does not seem like the option to disable canonicalization is present in jackson-core==2.15.2
. I believe his experiments showed no effect when trying to toggle this on 2.15.2
, while interning removed the bottleneck faced by @zshao9 . Any thoughts here on versioning?
@vadim Both options have been around for very long time, without many changes. So I don't think Jackson version should matter a lot here.
Difference between canonicalization disabling (no help?) and interning (helps) is interesting -- not sure when/how that would happen. Either way symbol table use is likely not working optimally for the use case.
I would then go with disabling interning since that helps. Improvement listed here will go in 2.18.0 but in general is unlikely to be super important compared to other options (IMO).
If we are going all-in on ReentrantLocks - I would still like to do each change its own PR so that each change gets looked at on its own merits.