eclipse-vertx / vert.x

Vert.x is a tool-kit for building reactive applications on the JVM
http://vertx.io
Other
14.29k stars 2.08k forks source link

Vertx pooled allocator should be the same as Netty #5168

Open franz1981 opened 7 months ago

franz1981 commented 7 months ago

Currently, in vertx https://github.com/eclipse-vertx/vert.x/blob/adbe97600cc6215f15ce2fac629c89f93618ca8f/src/main/java/io/vertx/core/buffer/impl/VertxByteBufAllocator.java#L25 is NOT reusing the Netty's PooledByteBufAllocator.DEFAULT, causing creation of more thread-local direct buffers and arenas, enlarging the RSS footprint of vertx application, for no reason.

What's the reason behind this choice @vietj?

The reason why it should be changed, is to "ease" the life of users and libraries which allocate Netty direct buffers using the Netty one and can end up allocating new arenas because of this.

If the aforementioned pool re-use the Netty one, clearly is getting some additional contention, but will save memory, which seems a reasonable trade-off.

zekronium commented 2 months ago

With this as is, impossible to use new adaptive allocator too.

@franz1981 this might affect quarkus too

Context: https://github.com/netty/netty/pull/13976

franz1981 commented 2 months ago

The reality seems more complex, see https://github.com/eclipse-vertx/vert.x/blob/7fdc3980c4e70bb71d3b27d471a41f17db581bca/vertx-core/src/main/java/io/vertx/core/net/impl/NetServerImpl.java#L516-L518

in short; with SSL we uses a different allocator (here -> https://github.com/eclipse-vertx/vert.x/blob/master/vertx-core/src/main/java/io/vertx/core/buffer/impl/PartialPooledByteBufAllocator.java) which uses the mentioned custom vertx allocator at https://github.com/eclipse-vertx/vert.x/issues/5168#issue-2208289433, but without SSL, instead, we just use the default Netty allocator.

To enable the adaptive one we should "at least" move into using a different entry point to setup the Netty configured "default" one i.e. https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/ByteBufAllocator.java#L24

franz1981 commented 2 months ago

@vietj @cescoffier this is something to fix on quarkus as well i think

This can affect quarkus as well if we don't reuse the allocator configured on the Netty instance, which could be different from the singleton at PooledByteBufAllocator.DEFAULT, causing to allocate both (and pay twice the RSS).

e.g https://github.com/quarkusio/quarkus/blob/261cc877718ef24dd681cb1f3bb1547208535fca/independent-projects/vertx-utils/src/main/java/io/quarkus/vertx/utils/AppendBuffer.java#L136

@vietj @geoand In the case above what's the "idiomatic" way for quarkus to obtain the configured allocator? I cannot see any chain of calls to get there, TBH.

geoand commented 2 months ago

In the case above what's the "idiomatic" way for quarkus to obtain the configured allocator? I cannot see any chain of calls to get there, TBH.

I am not sure I follow this... What exactly are you after here?

franz1981 commented 2 months ago

Yep, so, in short:

  1. based on the SSL configuration Vertx can use its own PartialPooledByteBufAllocator.INSTANCE or the Netty PooledByteBufAllocator.DEFAULT
  2. Quarkus AppendBuffer (and other code paths) just use the Netty one i.e. PooledByteBufAllocator.DEFAULT
  3. If SSL is configured, the RSS of off-heap arenas double, because the vertx and quarkus direct allocators are different

To solve this, we should obtain from vertx itself which configured allocator it uses, so we won't duplicate the RSS, makes sense @geoand ?

geoand commented 2 months ago

Sure, that makes sense, but it won't really involve any Quarkus APIs or anything - we just need the extensions code do the proper thing

franz1981 commented 2 months ago

IDK @geoand probably requires some change in the API till Vertx, to make sure we can query the configured allocator to vertx, or you have a better idea in mind? I'm very open, IDK yet how to solve it, TBH

geoand commented 2 months ago

Sounds reasonable

franz1981 commented 2 months ago

I've tried a different less invasive approach to this: see https://github.com/eclipse-vertx/vert.x/pull/5262

franz1981 commented 2 months ago

@zekronium are you a quarkus or vertx user? Just to understand If I ask your feedback if you want to give adaptive a shot :)

zekronium commented 2 months ago

@zekronium are you a quarkus or vertx user? Just to understand If I ask your feedback if you want to give adaptive a shot :)

98% Vert.x user with BoringSSL (tc-native) and JDK 21 (unsafe enabled with opens and useReflection=true) on top of that too. ~Also run virtual threads in some projects, which based of the known benefits of the adaptive pooler might be interesting to see if it helps~

In quarkus currently OpenSSL/BoringSSL is not supported per se from what I've seen in the issues, right?

EDIT: Scratch the Virtual Thread part, if I recall correctly, Vert.x even with VT's performs all the IO on the eventloops and returns you (Unsafe)HeapBuffers, which are unpooled afaik in all cases, which this new allocator does not cover. So I think because the virtual threads wont actually interact with the allocator, the benefit of potentially better performance with VT's wont be measurable in my case if I got that right

franz1981 commented 2 months ago

Exactly @zekronium - probably the only benefit you could have is if you do things on your own using the vertx allocator. - on VT. And clearly in Quarkus, because we use it for few things (e.g. jackson json encoding)

zekronium commented 2 months ago

@franz1981 Is Quarkus jackson json encoding different to the Vert.x one?

franz1981 commented 2 months ago

Yes @zekronium and will get better with time, like the wine :) see https://github.com/quarkusio/quarkus/pull/41063

where we bytecode generate the serializers to save reflection ;)

zekronium commented 2 months ago

Ah, reminds me of jsoniter-dynamic, they do similar things but its old and unmaintained now. Waiting for backport :)

zekronium commented 2 months ago

@zekronium are you a quarkus or vertx user? Just to understand If I ask your feedback if you want to give adaptive a shot :)

I have benchmarked our client application. Which method would you suggest to primarily look at as a common hotpath for both adaptive and default that would clearly direct comparison

For apples to apples, which is newDirectBuffer, which I think is something well to look at since it universally covers IO with our settings:

image image

This test was ran as a simple PoC with quite light load for 2 minutes. With it it seems to be about ~3x improvement!

These parts seem to consume slightly more CPU time, but might be by design?

image image
franz1981 commented 2 months ago

To better understand @zekronium :

  1. what's the type of benchmark you run?
  2. the throughput/rps for the 2 tests (without adaptive/with adaptive) was?
  3. what's the profiler you've used to get such numbers in https://github.com/eclipse-vertx/vert.x/issues/5168#issuecomment-2259633047? What's the meaning of these numbers/which resource is?
  4. any figure of the RSS/actual direct memory used?
zekronium commented 2 months ago

This was a simple “feeler” test I ran using our request client for two minutes. I simply attached intelij profiler on my m1 max mac.

It was a realistic load run, about 1k rps out, 1k connections. Rough usage was about 2gb with unsafe.

I will comeback with a proper benchmark on proper hardware, but I hope this info is enough for now

franz1981 commented 2 months ago

About your finding on the cost of buffer operations, it is expected, because the adaptive buffers piggyback to the usual Netty ones, but still way too much. I will run some micro on Netty itself to see if we can improve there and verify it.