apache / pulsar-client-python

Apache Pulsar Python client library
https://pulsar.apache.org/
Apache License 2.0
51 stars 39 forks source link

2.10.1 client deadlocks on Python GIL during logging while nacking and receiving messages #116

Closed zbentley closed 1 year ago

zbentley commented 1 year ago

We've observed full python interpreter lockups (not just "blocking": the interpreter calling the client halts; can't be unblocked or time out/raise exceptions, even if the blocking operation is moved to a python Thread and waited on with a timeout) in the presence of:

All of those have to be present to trigger the issue. When multiple Shared consumers are repeatedly nacking messages with a 15sec delay on a topic with a few hundred messages (100% of them are nacked over and over), all but one of the consumers eventually (within a few minutes) locks up--that is, no Python in that consumer can run. It's not just that it's blocked in a negative_acknowledge call, it's that all threads, signal handlers, coroutines, etc. in that interpreter are stuck. This says GIL conflict to me.

While this program has many hundreds of threads, the stacktraces from the most relevant ones are included here: threads.txt

BewareMyPower commented 1 year ago

From the threads info, the root cause is still related to the NegativeAcksTracker. Here are the simplified threads info:

Thread 2 stuck at:
  acquiring lock in NegativeAcksTracker::add

Thread 14 stuck at:
  acquiring GIL in LoggerWrapper::log
    ConsumerImpl::redeliverMessages
      NegativeAcksTracker::handleTimer

----

Thread 6 stuck at:
  acquiring GIL in LoggerWrapper::log
    ClientConnection::handleIncomingCommand

Thread 8 stuck at:
  acquiring GIL in LoggerWrapper::log
    ClientConnection::getConnectionAsync
      PartitionedConsumerImpl::getPartitionMetadata

Thread 15 stuck at:
  acquiring GIL in LoggerWrapper::LoggerWrapper
    ConsumerStatsImpl::flushAndReset

Thread 13 stuck at
  acquiring GIL in LoggerWrapper::log
    ClientConnection::handleIncomingCommand

Nearly all threads except thread 2 are stuck at acquiring GIL in LoggerWrapper's methods. It means the GIL was hold by another thread, which is likely thread 2 because it calls negative_acknowledge. However, thread 2 is also stuck because the lock in NegativeAcksTracker is hold by thread 14.

In conclusion:

Since the GIL won't be released, all threads that try to acquire the GIL will be stuck. It will block nearly all threads when the Python logger is customized because all methods of LoggerWrapper requires the GIL.

I'm going to check the code in further progress and see how could we fix this deadlock.

(BTW, this threads info is strange that I see CaptivePythonObjectMixin in it. But this class was first introduced in 3.0.0)

BewareMyPower commented 1 year ago

I opened a PR in the C++ client repo to avoid the deadlock caused by user-provided logger: https://github.com/apache/pulsar-client-cpp/pull/266