Blocked threads on synchronized block

saugatac commented 8 years ago

Hi, We're using the below configuration. Application was working fine for few weeks. One fine day it slowed down and lot of threads are showing as blocked. There's no deadlock in thread dump. All applications are in same network/DC, so there shouldn't be network issues. Kafka system doesn't seem to have any resource related (CPU/Memory) issues. I'm not sure of root cause yet. Please let me know if you need more information.

                <keyingStrategy
                        class="com.github.danielwegener.logback.kafka.keying.RoundRobinKeyingStrategy" />
                <deliveryStrategy
                        class="com.github.danielwegener.logback.kafka.delivery.AsynchronousDeliveryStrategy" />

                <!-- each <producerConfig> translates to regular kafka-client config (format:
                                             key=value) -->
                <!-- producer configs are documented here: https://kafka.apache.org/documentation.html#newproducerconfigs -->
                <!-- bootstrap.servers is the only mandatory producerConfig -->
                <producerConfig>bootstrap.servers=kafka-1:9092,kafka-2:9092,kafka-3:9092</producerConfig>
                <!-- restrict the size of the buffered batches to 8MB (default is 32MB) -->
        <producerConfig>buffer.memory=33554432</producerConfig>
        <!-- even if the producer buffer runs full, do not block the application but start to drop messages -->
                <!-- <producerConfig>block.on.buffer.full=false</producerConfig> -->
                <producerConfig>acks=1</producerConfig>
                <producerConfig>retries=1</producerConfig>
                <!-- The compression type for all data generated by the producer. -->
                <producerConfig>compression.type=gzip</producerConfig>
                <!-- Batch size of records into fewer requests in bytes -->
                <producerConfig>batch.size=163840</producerConfig>
                <!-- wait up to 1000ms and collect log messages before sending them as
                                             a batch -->
                <producerConfig>linger.ms=1000</producerConfig>
                <!-- The configuration controls how long KafkaProducer.send() and KafkaProducer.partitionsFor()
                                             will block. These methods can be blocked either because the buffer is full
                        or metadata unavailable -->
                <producerConfig>max.block.ms=2000</producerConfig>

"pool-7-thread-97" #3652 prio=5 os_prio=0 tid=0x00007fb834093800 nid=0x1106d waiting for monitor entry [0x00007fb6c9617000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at com.github.danielwegener.logback.kafka.KafkaAppender$LazyProducer.get(KafkaAppender.java:150)
        - waiting to lock <0x0000000715781fe8> (a com.github.danielwegener.logback.kafka.KafkaAppender$LazyProducer)
        at com.github.danielwegener.logback.kafka.KafkaAppender.append(KafkaAppender.java:118)
        at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
        at com.github.danielwegener.logback.kafka.KafkaAppender.doAppend(KafkaAppender.java:51)
        at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:48)
        at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270)
        at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257)
        at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:422)
        at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:384)
        at ch.qos.logback.classic.Logger.debug(Logger.java:495)

danielwegener commented 7 years ago

Ouch, thats a bummer. The blocking on the lazy producer should only happen on application start, (can you confirm that this blocking behaviour only happens on application startup?).

Regarding the stack you posted, It would be really interesting where that Logger.debug call (last line) comes from. If it comes from the new KafkaProducer() itself, we have a real problem. If the problem is still reproducible, you could try to silence all Kafka logs in the logger configuration (i.e. setting all logs to "org.apache.kafka." to off). If that helps, it could be a real problem in the appender.

Background: We have a bootstrap logic that delays all slf4j log calls from the kafka package until a log message from a non-kafka package arrives. This mechanism should ensure that, while the LogbackKafkaAppender is starting, it cannot block itself recursively (i.e. the Appender starts the KafkaProducer, the KafkaProducer logs stuff and calls the Appender which now blocks until the KafkaProducer is started). However, If a non-Kafka message sneaks into the Appender, especially from the thread that bootstrapping the appender at that time, it could run into a deadlock.

Some more information would be useful:

Which version of logback-kafka-appender do you use (I'd assume 0.1.0)?
Which version of kafka-clients do you use? Do you also use it for "business code", or only in the logback-kafka-appender?
What "bootstrap" mechanism do you use (like spring-boot, servlet)? Are there multiple threads trying to log stuff before the whole logback-classic infrastructure has started/settled?

Edit: @saugatac Sorry that this reply took a while.

aerskine commented 7 years ago

@saugatac @danielwegener this can happen with logging level set to TRACE or ALL, since the Kafka client quickly makes a recursive call to doAppend() which is not caught by logback's reentrancy guard. In my pull request I have moved the deferral logic inside the guard, which will lead to recursive kafka log messages being dropped completely.

saugatac commented 7 years ago

@danielwegener, sorry I somehow missed the notification from your comment.

The next class in the stacktrace was our business class, I'll need to search a bit for the thread dump, will update the full dump. Yes the application is also a kafka consumer, of the same kafka cluster. Kafka client is 0.9.0.0 logback-kafka-appender is 0.1.0 This is a jetty/spring application. Everything is spring bean, so I would assume that initialization would be single threaded, but I cannot definitively say whether multiple threads are trying to write before logback infrastructure has settled. The application in question was initially listening on rabbitmq only; during that time the appender worked great for over a month. The issue happened about couple of days after it started listening on kafka. So, there might be a correlation.

@aerskine Although we do not use trace/all (we use debug) but the timing was around peak time for the app and it does generate a lot of logs at the time.

My hypothesis is it has something to do with the combination of high load/concurrency + kafka being used for both consuming messages and sending logs

avoxm commented 6 years ago

@danielwegener any update on this issues ? Not to be able to log anything above INFO level is a serious limitation :(...if you can provide details on your findings I can help with the actual implementation of the fix.

danielwegener / logback-kafka-appender

Blocked threads on synchronized block #33