Closed chris-miller-klaviyo closed 1 year ago
I also notice that the example code in the docs for Async producer example
results in this same deadlock w/ pulsar-client 3.0 in my testing.
https://pulsar.staged.apache.org/docs/v1.21.0-incubating/clients/Python/#Asyncproducerexample-1zg7id
This seems like a really fundamental breakage. The c++ code doesn't look GIL-safe at all. One of the main happy paths with this client seems to potentially corrupt interpreter state (just getting a deadlock, as in this issue, seems like a happy coincidence; operating outside of the GIL could break anything, right?)
Would this merit pulling or otherwise hot-fixing the PyPi package?
@zbentley It's mostly a difference between Boost:Py and PyBind.
The c++ code doesn't look GIL-safe at all.
It's only the sendAsync that is impacted.
One of the main happy paths with this client seems to potentially corrupt interpreter state
The problem is the reverse.. doing stuff with the mutex locked while it should have been unlocked. I cannot see any potential danger to corrupt the interpreter state.
Would this merit pulling or otherwise hot-fixing the PyPi package?
We cannot pull packages from PyPI, though we'll get a release out ASAP.
@merlimat can you expand on It's only the sendAsync that is impacted
?
I also was a little confused as to how GIL handling works in the fix PR: https://github.com/apache/pulsar-client-python/pull/87/files#r1093385236
Unless something is handling grabbing the GIL when the callback is triggered, it seems like that PR may resolve the deadlock at the expense of making the python interactions non-thread-safe.
@zbentley
can you expand on It's only the sendAsync that is impacted?
In this Python wrapper we mostly expose sync methods, eg: client.create_produce()
, etc.. The only async method is producer.send_async()
.
For sync methods, we internally use the C++ async version (to avoid blocking when checking for Python interruptions) and handle the GIL properly. We have a function for that purpose: https://github.com/apache/pulsar-client-python/blob/main/src/utils.h#L60-L87
Unless something is handling grabbing the GIL when the callback is triggered, it seems like that PR may resolve the deadlock at the expense of making the python interactions non-thread-safe.
You can see in the stack trace above that PyBind is already automatically acquiring the GIL when triggering the Python callback :
thread #2
frame #0: 0x00007ff80a1943ea libsystem_kernel.dylib`__psynch_cvwait + 10
frame #1: 0x00007ff80a1cea6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
frame #2: 0x0000000105a3cb9f python3.7`take_gil + 255
frame #3: 0x0000000105a3cfb3 python3.7`PyEval_AcquireThread + 19
frame #4: 0x000000010870be43 _pulsar.cpython-37m-darwin.so`pybind11::gil_scoped_acquire::gil_scoped_acquire() + 83
frame #5: 0x00000001087714f3 _pulsar.cpython-37m-darwin.so`pybind11::detail::type_caster<std::__1::function<void (pulsar::Result, pulsar::MessageId const&)>, void>::load(pybind11::handle, bool)::func_handle::func_handle(func_handle const&) + 35
frame #6: 0x00000001087715f1 _pulsar.cpython-37m-darwin.so`std::__1::__function::__func<pybind11::detail::type_caster<std::__1::function<void (pulsar::Result, pulsar::MessageId const&)>, void>::load(pybind11::handle, bool)::func_wrapper, std::__1::allocator<pybind11::detail::type_caster<std::__1::function<void (pulsar::Result, pulsar::MessageId const&)>, void>::load(pybind11::handle, bool)::func_wrapper>, void (pulsar::Result, pulsar::MessageId const&)>::__clone() const + 49
frame #7: 0x00000001088daccd _pulsar.cpython-37m-darwin.so`std::__1::__function::__func<pulsar::ProducerImpl::sendAsync(pulsar::Message const&, std::__1::function<void (pulsar::Result, pulsar::MessageId const&)>)::$_2, std::__1::allocator<pulsar::ProducerImpl::sendAsync(pulsar::Message const&, std::__1::function<void (pulsar::Result, pulsar::MessageId const&)>)::$_2>, void (pulsar::Result, pulsar::MessageId const&)>::__clone() const + 93
frame #8: 0x000000010878f7de _pulsar.cpython-37m-darwin.so`pulsar::OpSendMsg::OpSendMsg(pulsar::OpSendMsg const&) + 126
That makes sense @merlimat, thanks. What about "async" (I think it's more "c++ deciding to call Python" rather than the other way around) functionality like the Python logger, or the consumer's on-message callbacks?
The logger and the message listeners should be fine since they are both going to acquire the GIL when entering the Python callbacks.
This is assuming that after this PR there are no other places that are going to lock on any Pulsar client internal locks while holding the GIL, as was the case with send_async()
.
I understand now. Thank you!
Thanks for the clear repro code and the above info. That made it really easy!
Hi, is there a chance that the latest release candidate might be released soon 🙏 we're looking forward to having this bug fixed in an official release! 🎉
This script reproduces the deadlock pretty reliably with python 3.7.13, and pulsar-client-3.0 on an x86 Macbook Pro. We originally noticed this in much more complex producer logic in our CI pipeline on linux, but this minimally seems to get into the same deadlocked state.
With an older pulsar-client version
2.10.2
, this works as expected:However, if I upgrade to
pulsar-client==3.0
, that script gets deadlocked, and does not respond to SIGINT:We have observed this on x86 Mac laptops, and on linux (in our CI system, testing a much more complex producer than in the script above).
lldb
on mac shows the following thread dump of the deadlocked process:I'm not really a pybind/boost expert, but it looks to me like maybe PyBind11 is trying to acquire the GIL in a way that ends up causing deadlocks that didn't occur before PyBind11 was introduced.
Do you have a sense of what that deadlock may be caused by, and how to fix it?