apache / bookkeeper

Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads
https://bookkeeper.apache.org/
Apache License 2.0
1.91k stars 902 forks source link

ByteBuf returned by BookKeeper triggers CRC Checksum calculation when calling "readBytes" #4372

Open eolivelli opened 5 months ago

eolivelli commented 5 months ago

BUG REPORT

Describe the bug I have developting a Pulsar BrokerInterceptor. The BrokerInterceptor is able to process the data that Pulsar read from BookKeeper, without memory copies. While analysing a flamegraph I have seen that the ByteBuf returned by BookKeeper shows this weird behaviour and uses lot of CPU.

This is a flame graph. The version of BookKeeper is based on latest 4.16.x

image

To Reproduce

See the flamegraph

Expected behavior

readBytes has very little overhead

eolivelli commented 5 months ago

Interceptor code is here: https://github.com/datastax/pulsar-jms/blob/master/pulsar-jms-filters/src/main/java/com/datastax/oss/pulsar/jms/selectors/JMSPublishFilters.java#L277

lhotari commented 5 months ago

that seems like a strange flamegraph. I don't see how readBytes could trigger CRC calculation.

ByteBufVisitor is used in checksum calculations: https://github.com/apache/bookkeeper/blob/f8eb7a06a683e690e8a70b5bd72927e4df70aa51/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L69-L71 PR was #4196

eolivelli commented 5 months ago

The ByteBuf was not coming from a read from the BookKeeper client, because the ByteBug is coming from the network (it is the Pulsar producer that is sending a message and the interceptor processes it)

But maybe it is a ByteBuf recycled ?

lhotari commented 5 months ago

The ByteBuf was not coming from a read from the BookKeeper client, because the ByteBug is coming from the network (it is the Pulsar producer that is sending a message and the interceptor processes it)

But maybe it is a ByteBuf recycled ?

It's hard to see how it could result in the stacktrace even if there was a recycling bug. GetBytesCallbackByteBuf instance is not stored as a reference anywhere and gets passed as a parameter here: https://github.com/apache/bookkeeper/blob/f8eb7a06a683e690e8a70b5bd72927e4df70aa51/bookkeeper-server/src/main/java/org/apache/bookkeeper/util/ByteBufVisitor.java#L138 .

Perhaps it's a profiler issue.

Please add -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints to JVM options to prevent any issues in this area:

  • When agent is not loaded at JVM startup (by using -agentpath option) it is highly recommended to use -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints JVM flags. Without those flags the profiler will still work correctly but results might be less accurate. For example, without -XX:+DebugNonSafepoints there is a high chance that simple inlined methods will not appear in the profile. When the agent is attached at runtime, CompiledMethodLoad JVMTI event enables debug info, but only for methods compiled after attaching.

It might also be useful to compare Async Profiler 2.9 and 3.0 results. Just to be sure that the new stacktrace solution in 3.0 isn't causing the problem.

hangc0276 commented 5 months ago

I also found the checksum cost a lot of CPU

image
lhotari commented 5 months ago

I also found the checksum cost a lot of CPU image

@hangc0276 I guess it is expected to consume a lot of CPU? In Enrico's case, the flamegraph doesn't seem to be valid and my assumption was that adding -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints to JVM options would fix it since it's recommended to use these JVM options to get proper results while profiling.