FasterXML / jackson-core

Core part of Jackson that defines Streaming API as well as basic shared abstractions
Apache License 2.0
2.26k stars 792 forks source link

Be ready for Virtual Thread: avoid ThreadLocal pooling #919

Closed franz1981 closed 11 months ago

franz1981 commented 1 year ago

Hi team.

this has been raised because, as Jackson users i.e https://github.com/quarkusio/quarkus with existing (experimental) support for https://quarkus.io/guides/virtual-threads, we would like to help fixing any integration issue with such technology.

Based on https://openjdk.org/jeps/425 and https://openjdk.org/jeps/429, re thread local pooling:

There are a few scenarios that favor thread-local variables. An example is caching objects that are expensive to create and use, such as instances of java.text.DateFormat. Notoriously, a DateFormat object is mutable, so it cannot be shared between threads without synchronization. Giving each thread its own DateFormat object, via a thread-local variable that persists for the lifetime of the thread, is often a practical approach.

In short, https://openjdk.org/jeps/429 warn against using ThreadLocals as pooling pattern, and that Scoped Values mechanism won't save the day nor will be able to replace such mechanism.

said that, where Jackson exibit this problem? Let me attach some data: image

In the above flamegraph, a lot of cycles are spent in https://github.com/FasterXML/jackson-core/blob/390472bbdf4722fe058f48bb0eff5865c8d20f73/src/main/java/com/fasterxml/jackson/core/util/BufferRecyclers.java#L67 while called from a virtual thread forces both ThreadLocal and subsequent pooled buffers allocations.

The common usage scenario for virtual threads is to NOT cache them (as the Loom team advertise against it) and to be "a lot" and short lived; meaning that Jackson will end up creating tons of useless garbage that won't be reused, defeating the purpose and benefit of pooling.

What solution exists here?

  1. in case of virtual thread encoding, use a proper pool that's decent enough to deal with high contention, instead of the usual thread local one
  2. JDK makes uses of an "hidden" feature for cases like these i.e. Per-carrier-thread cache

For 1 we can make uses of JCtools mpmc queues designed to be very good with contention (see https://github.com/JCTools/JCTools/wiki/Getting-Started-With-JCTools#example-a-simple-object-pool for more info), but still, if we care about be friendly on contention, we could use a striped approach a-la Striped64 or simpler too e.g array of atomic ref where to perform getAndSet or compareAndSet (and distribute the first picking choice with Knuth multiplicative hashing + xorshifts).

For 2 I suggest to raise the problem (I can help in case) to the loom-dev list and ask if the JDK team got plan to open up that nice (and dangerous) API in some (better and less risky) form, to allow ThreadLocal pooling users to still use the same exact mechanism.

franz1981 commented 1 year ago

Each of the 2 proposed solution got its own (not critical) problems anyway:

  1. subsequent borrow on the same v thread, will perform a fresh new resource borrowing, because the previously acquired one cannot be stored in any context (unless there's an explicit one) (that's the purpose of Scoped Value as context storage)
  2. given that v threads can switch carrier in their life cycle, it means that a borrowed resource from carrier A can be released to another carrier B, making resources to be "unbalanced"
pjfanning commented 1 year ago

Yes, ThreadLocals will need to go. The problem is work prioritisation. Maybe Quarkus users would be better advised to try other JSON libs that are further ahead of the curve on Loom support. Loom users will need to recognise that these old projects weren't written with Loom in mind.

geoand commented 1 year ago

I think it would be in most people's interest (certainly our Quarkus users who love Jackson) to work together to come up with a solution instead of having to settle for some lesser library

franz1981 commented 1 year ago

@pjfanning thanks for the quick feedback; We're very happy Jackson users and maybe there are solutions that won't involve changes in the current priority management on this project eg exposing an SPI for the buffer pooling to let users provide their own (or disable it)

pjfanning commented 1 year ago

There's https://github.com/FasterXML/jackson-core/issues/835 - if someone wants to do a PR for that that also takes this into account, that would be great. I would not read too much into the 2.15 label on that issue because we have users crying out for a 2.15 release and not much time to squeeze extra items into that release. And a 2.16 release would probably not happen for quite some time after that.

pjfanning commented 1 year ago

In the meantime, has any Loom fan tried playing with the JsonFactory.Feature.USE_THREAD_LOCAL_FOR_BUFFER_RECYCLING setting - ie disabling this feature?

franz1981 commented 1 year ago

@pjfanning

tried playing with the JsonFactory.Feature.USE_THREAD_LOCAL_FOR_BUFFER_RECYCLING setting - ie disabling this feature?

TBH I was personally too scared to do it eheh but it's a great suggestion and it can be a no-brain workaround for this, in the meantime - although it would just save creating the thread local, while keeping the same problem ie allocating big buffer(s) based on the default sizes i.e. https://github.com/FasterXML/jackson-core/blob/62c1ef923703654d202b206f345898ed4f6b640a/src/main/java/com/fasterxml/jackson/core/util/BufferRecycler.java#L76-L77

The other problem is, given that's a wide sys level setting, it will impact everyone, while, as Netty users we can have mixed type of threads using jackson, and will hit the I/O threads too, that are poor "platform" threads (just thinking loud)

geoand commented 1 year ago

@franz1981 do you need help setting that in Quarkus?

geoand commented 1 year ago

The other problem is, given that's a wide sys level setting, it will impact everyone, while, as Netty users we can have mixed type of threads using jackson, and will hit the I/O threads too, that are poor "platform" threads (just thinking loud)

In Quarkus we can likely limit the blast radious of this by using separate ObjectMapper objects if needed

cowtowncoder commented 1 year ago

Yes, this is sort of expected problem but tricky one to resolve, esp. wrt configurability. Pluggable buffer recycling implementation would make sense, and wrt timelines rather sooner than later (so def in 2.x). I would be happy to help with this, but may not have time to look into this myself very soon (despite considering it important obviously).

I would definitely suggest trying JsonFactory.Feature.USE_THREAD_LOCAL_FOR_BUFFER_RECYCLING in the meantime.

@franz1981 Just to make sure: this is not really system-wide setting, but per-JsonFactory (which then is per-ObjectMapper), with all of pros and cons. I have not added use of System properties due to issues with global scope and use by different frameworks; Netty is a good example.

pjfanning commented 1 year ago

Ultimately, the use of ThreadLocal here is a form of resource pooling and we could use a pool implementation explicitly instead. Thing is, pool implementations are not necessarily very efficient themselves. Anyone have a suggestion about low latency object pool implementations?

cowtowncoder commented 1 year ago

Aside from legit questions about buffer pooling, recycling, there's the question of how to interface this with existing JsonFactory/JsonParser/JsonGenerator implementations.

Looking at existing recycling, the core abstraction really is a single BufferRecycler, accessed by JsonFactory via BufferRecyclers. BufferRecyclers has the ThreadLocal. This would not necessarily be a problem if we refactored things a bit -- except for one major problem: there is no release call for BufferRecycler instances but only for actual buffers. This works ok with traditional threading model (since BufferRecycler instances are per-thread and so references by JsonParser/JsonGenerator do not live past their life-cycle), but would not work if thread-per-BR coupling is removed.

And it seems that trying to build life-cycle for BufferRecycler is the tricky part: the reason there is no release call is both because it is not needed (as I said, next parser/generator that ask for it is for same thread) and because figuring out when all reuse is done is not easy: calling code doesn't really have to close() JsonParser or JsonGenerator (at least for purposes of buffer recycling).

Then again, maybe requiring BufferRecycler release from close() would not be a big deal. It'd probably also mean that BufferRecycler instances accessed using centralized pooling mechanism would need to keep references to "owning" BufferRecyclers so that they can indicate being released.

Maybe I should try figuring out how to do this refactoring after all.

franz1981 commented 1 year ago

@pjfanning

Thing is, pool implementations are not necessarily very efficient themselves. Anyone have a suggestion about low latency object pool implementations?

Yep, I'm a developer of the JCTools library (see https://github.com/JCTools/JCTools/blob/master/pom.xml#L42), that implement few multi producer multi consumer queues that could be used as a "simple" shared pool (the other option is a ConcurrentLinkedQueue, but is very memory hungry and slow, in comparison). You can check an independent benchmark here at https://vmlens.com/articles/scale/scalability_queue/ and if you're worried about the dependencies, given that I'm a developer (and inventor) of such algorithm, I have no problem to implement a specialized version just for Jackson. Said that, usually such solutions are not "wow" because highly contended tail/head are not great and it's not easy to enforce global memory usage limits, but if you're interested I can propose something ad-hoc in the next week that provide the best of the 2 worlds.

The other option, instead, is to complain with loom-dev (in a kind and respectful way), because "they" have the right abstraction for that (mentioned in https://github.com/FasterXML/jackson-core/issues/919#issue-1576266381) i.e. Per-carrier-thread cache. The beauty of providing a generic buffer pool API is that , when/if this nice impl from JDK will ever be available, the replacement will be immediate (and will replace the ThreadLocal one too, because is a "transparent" thread local, ignoring that the caller is a virtual thread)

@cowtowncoder

It'd probably also mean that BufferRecycler instances accessed using centralized pooling mechanism would need to keep references to "owning" BufferRecyclers so that they can indicate being released.

I cannot find a central point thats state that a BufferRecycler can be safely released/recycled (IOContext borrow a single instance meaning that IOContext is a non thread safe- thread confined instance IIUC); although IOContext can keep track of acquire/release of single parts of the buffer and could decide to release it fully when every, already borrowed ones, are released, but seems complex and not "natural".

cowtowncoder commented 1 year ago

@franz1981 The way release would have to happen, I think, is with close() of JsonParser/JsonGenerator instance, clearing local BufferRecycler first, then calling recycler.release() method. Assumption that buffers themselves are first given back holds and has been the case for years now (i.e. this has not ever caused an issue, to the best of my knowledge).

IOContext is a dumb container to pass recycler to parser/generator and doesn't really have control over life-cycle. Whether it should be used to pass instance is a good question, I'll just note that's what happens as of now (probably was convenient at some point, avoiding passing more arguments to parser/generator constructors).

Given just a bit of time I could refactor things bit by bit for 2.15. But... time is limited alas :)

magicprinc commented 1 year ago

Another alternative https://github.com/DanielYWoo/fast-object-pool the last still living object pool for java

Has advantage: can validate object (similar as jdbc connection pools do)

magicprinc commented 1 year ago

@franz1981 BTW Quarkus has its own HikariCP Killer: agroal jdbc connection pool https://agroal.github.io/

Is is possible to extract core-pool from it?

cowtowncoder commented 1 year ago

Just to make sure: I do not think we want to include anything complicated or sizable within jackson-core. Instead these should be add-ons, plugged via usual extension mechanism. And such plug-in could just use regular dependencies to object pool library/framework.

It looks like Fast Object Pool might be good base for such an extension.

geoand commented 1 year ago

@magicprinc the question about Agroal can be answered by @barreiro

barreiro commented 1 year ago

@franz1981 BTW Quarkus has its own HikariCP Killer: agroal jdbc connection pool https://agroal.github.io/

Is is possible to extract core-pool from it?

no, it's not possible. Agroal is a specialized JDBC pool. it doesn't abstract away in order to obtain the best performance.

magicprinc commented 1 year ago

BTW, How Spring 6 has solved this problem? They had a lot of things in ThreadLocal, e.g. Transaction processing (active transaction), HttpRequest/Response, Spring context itself (in web enviroment)

magicprinc commented 1 year ago

@cowtowncoder It would be easier to make such add-ons if all ThreadLocal's were encapsulated in one Context class with all cached classes and then a user could choose: 1) ThreadLocal add-on to store Context 2) Vert.x add-on which can store it in Vert.x Reactive Context 3) Object Pool 4) JCTools' MPMC Queue 5) https://github.com/EsotericSoftware/kryo 's Pool

cowtowncoder commented 1 year ago

FWTW we do have non-ThreadLocal buffer recycler pool now, although need some final tweaks (wrt #1106). Default for 2.16 at least will remain ThreadLocal-backed one but beyond that we could consider changing the default. And frameworks are obviously free to re-configure defaults as they see fit starting with 2.16 (but may be limited wrt range of Jackson version supported).

cowtowncoder commented 1 year ago

@cowtowncoder It would be easier to make such add-ons if all ThreadLocal's were encapsulated in one Context class with all cached classes and then a user could choose:

1. ThreadLocal add-on to store Context

2. Vert.x add-on which can store it in Vert.x Reactive Context

3. Object Pool

Already adding pluggability for 2.16; only one use of ThreadLocal in jackson-core. Usage not wide-spread; although I think one or two dataformats (Smile I think) have limited additional usage. In general I am not a fan of ThreadLocal usage so it's somewhat bounded problem.

magicprinc commented 1 year ago

@franz1981 Are there any benchmarks: MPMC Queue -vs- https://github.com/DanielYWoo/fast-object-pool -vs- same fast-object-pool but with disruptor (an available option)?

PS: I mention fast-object-pool because it is the last alive java generic object pool (recent commits on github) and can use disruptor.

I know two other pools, but they are not separate project, but classes inside main project Kryo Pool and Vert.x Pool

magicprinc commented 1 year ago

Sorry for misinformation about Spring.

They still use ThreadLocal (e.g. https://github.com/spring-projects/spring-framework/blob/main/spring-tx/src/main/java/org/springframework/transaction/support/TransactionSynchronizationManager.java ), BUT they use ThreadLocal not as cache, but as context or "session".

So: Thread.start → startTransaction → session created → obj1.methodA → obj2.methodB → commit Transaction → session closed → thread.stop

No difference for platform thread or virtual thread.

So, this is completely another story :-(

franz1981 commented 1 year ago

@magicprinc

Are there any benchmarks: MPMC Queue -vs- https://github.com/DanielYWoo/fast-object-pool -vs- same fast-object-pool but with disruptor (an available option)?

Users of quarkus (which is not here the right place to discuss this, so please raise the question on the quarkus github repo) can customize jackson which include, when the new release with the custom pool, the ability to use the one they prefer.

If it was my call/choice I won't still pick fast-object-pool, because:

Said that, @mariofusco is using a proper JMH microbench to evaluate different options, including the ones currently offered in jackson, by default.

magicprinc commented 1 year ago

@franz1981 Thank You! One hundred likes! 👍

My finest solution, how to work with ANY version of Jackson in Virtual Threads without ANY object pool.

⇒ Combining the strengths of different pools.

    @Test void _jacksonInVtConcept () throws InterruptedException {
        Executor handMadeVirtualThreadExecutor = r->new Thread(r).start();// empty ThreadLocal

        Callable<String> taskReqThreadLocal = ()->{
                var m = Jack.MAPPER.readValue("{a:1,b:true}", Map.class);
                return Jack.MAPPER.writeValueAsString(m);
            };
        var doneSignal = new CountDownLatch(2);

        //1 slow "classic"
        handMadeVirtualThreadExecutor.execute(Unchecked.runnable(()->{
            var jsonStr = taskReqThreadLocal.call();
            assertEquals("{\"a\":1,\"b\":true}", jsonStr);
            doneSignal.countDown();
        }));

        //2 fast "magic"
        handMadeVirtualThreadExecutor.execute(Unchecked.runnable(()->{
            var jsonStr = ForkJoinPool.commonPool().submit(taskReqThreadLocal).get();
            assertEquals("{\"a\":1,\"b\":true}", jsonStr);
            doneSignal.countDown();
        }));

        assertTrue(doneSignal.await(2, TimeUnit.SECONDS));
    }
franz1981 commented 11 months ago

@cowtowncoder I think we can happily close this :) thanks for the hard work @pjfanning @cowtowncoder and @mariofusco !!!

cowtowncoder commented 11 months ago

Yes! Thank you for reminder @franz1981.

LoranceChen commented 1 month ago

hi, does there are conclusion to support Virtual thread? I haven't see any PR code change for jackson-core.

As for this one, it's work around, but some performance slow compare with directly execute.

Virtual Threads - millions, slow, cheap to create and block, "empty" ThreadLocal. Block them easily.

//2 fast "magic"
handMadeVirtualThreadExecutor.execute(Unchecked.runnable(()->{
var jsonStr = ForkJoinPool.commonPool().submit(taskReqThreadLocal).get();
assertEquals("{\"a\":1,\"b\":true}", jsonStr);
doneSignal.countDown();
}));

@magicprinc This one will slow 2~2.5x compare with directly execute. I just test on my MacOS with a simple object.

By the way, using OS thread will slow 7x

LoranceChen commented 1 month ago

Need a feature can using locked of CarrierThreadLocal which like ScopedValue but used for smaller range for using ThreadLocal cached object.

I though the usage can be similiar this snippet:

// some virtual thread context here

// using CareerTHreadLocal do serialize
var bytes = CarrierScopedValue.get(ObjectMapper). // get CarrierThreadLocal object
                   execute( objectMapper -> {objectMapper.writeXXX()}); // execute "atomic" logic with the ThreadLocal object.

// continue logic on virtual thread and using result `bytes`

The CarrierScopedValue.execute method can promised virtual thread not be unmounted from current CarrierThread.

It will be a perfect solution if JDK can support CarrierThreadLocal for requirement of object pool with ThreadLocal in virtual thread.

franz1981 commented 1 month ago

@cowtowncoder I think this issue could be closed due to https://github.com/FasterXML/jackson-core/pull/1061 being already merged

cowtowncoder commented 1 month ago

@franz1981 Issue was closed a while ago.

@LoranceChen Let's not discuss things on closed threads: if more work/improvements suggested, please file a new issue (with link back to this one as background, if that makes sense).

LoranceChen commented 1 month ago

get it. I will consult with another topic if needed.