Open saugatac opened 8 years ago
Ouch, thats a bummer. The blocking on the lazy producer should only happen on application start, (can you confirm that this blocking behaviour only happens on application startup?).
Regarding the stack you posted, It would be really interesting where that Logger.debug
call (last line) comes from. If it comes from the new KafkaProducer()
itself, we have a real problem. If the problem is still reproducible, you could try to silence all Kafka logs in the logger configuration (i.e. setting all logs to "org.apache.kafka." to off). If that helps, it could be a real problem in the appender.
Background: We have a bootstrap logic that delays all slf4j log calls from the kafka package until a log message from a non-kafka package arrives. This mechanism should ensure that, while the LogbackKafkaAppender is starting, it cannot block itself recursively (i.e. the Appender starts the KafkaProducer, the KafkaProducer logs stuff and calls the Appender which now blocks until the KafkaProducer is started). However, If a non-Kafka message sneaks into the Appender, especially from the thread that bootstrapping the appender at that time, it could run into a deadlock.
Some more information would be useful:
Edit: @saugatac Sorry that this reply took a while.
@saugatac @danielwegener this can happen with logging level set to TRACE or ALL, since the Kafka client quickly makes a recursive call to doAppend() which is not caught by logback's reentrancy guard. In my pull request I have moved the deferral logic inside the guard, which will lead to recursive kafka log messages being dropped completely.
@danielwegener, sorry I somehow missed the notification from your comment.
The next class in the stacktrace was our business class, I'll need to search a bit for the thread dump, will update the full dump. Yes the application is also a kafka consumer, of the same kafka cluster. Kafka client is 0.9.0.0 logback-kafka-appender is 0.1.0 This is a jetty/spring application. Everything is spring bean, so I would assume that initialization would be single threaded, but I cannot definitively say whether multiple threads are trying to write before logback infrastructure has settled. The application in question was initially listening on rabbitmq only; during that time the appender worked great for over a month. The issue happened about couple of days after it started listening on kafka. So, there might be a correlation.
@aerskine Although we do not use trace/all (we use debug) but the timing was around peak time for the app and it does generate a lot of logs at the time.
My hypothesis is it has something to do with the combination of high load/concurrency + kafka being used for both consuming messages and sending logs
@danielwegener any update on this issues ? Not to be able to log anything above INFO level is a serious limitation :(...if you can provide details on your findings I can help with the actual implementation of the fix.
Hi, We're using the below configuration. Application was working fine for few weeks. One fine day it slowed down and lot of threads are showing as blocked. There's no deadlock in thread dump. All applications are in same network/DC, so there shouldn't be network issues. Kafka system doesn't seem to have any resource related (CPU/Memory) issues. I'm not sure of root cause yet. Please let me know if you need more information.