pulsar-client 2.5 - Segmentation fault: 11 on Mac OS 10.14.6 using Python 3.7.6

spark-louissalin commented 4 years ago

Describe the bug Repeatedly opening and closing a client and producer will eventually cause a segmentation fault 11 and the python interpreter will crash.

To Reproduce Within a python 3.7.6 interpreter and using the pulsar-client 2.5 pip library, repeat the following until a seg fault happens:

create a client
create a producer
close the client

For example:

$ python
Python 3.7.6 (default, Dec 30 2019, 19:38:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pulsar
>>> client = pulsar.Client("pulsar://localhost:6650")
>>> producer = client.create_producer("some topic")
[pulsar output...]
>>> client.close()
[pulsar output...]
>>> client = pulsar.Client("pulsar://localhost:6650")
>>> producer = client.create_producer("some topic")
[pulsar output...]
Segmentation fault: 11
$

Expected behavior I should be able to run these 3 lines forever without trouble

Desktop (please complete the following information):

OS: Mac OS 10.14.6

apapia commented 4 years ago

Not sure if this is the same bug, but I'm seeing a segfault when running pulsar-client 2.5 with Python 3.7.7 on Mac OS 10.13.6. This is the stack trace I get:

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   _pulsar.cpython-37m-darwin.so   0x0000000104e2fd9d unsigned long boost::asio::detail::kqueue_reactor::cancel_timer<boost::asio::time_traits<boost::posix_time::ptime> >(boost::asio::detail::timer_queue<boost::asio::time_traits<boost::posix_time::ptime> >&, boost::asio::detail::timer_queue<boost::asio::time_traits<boost::posix_time::ptime> >::per_timer_data&, unsigned long) + 29
1   _pulsar.cpython-37m-darwin.so   0x0000000104e2fc64 boost::asio::detail::deadline_timer_service<boost::asio::time_traits<boost::posix_time::ptime> >::cancel(boost::asio::detail::deadline_timer_service<boost::asio::time_traits<boost::posix_time::ptime> >::implementation_type&, boost::system::error_code&) + 52
2   _pulsar.cpython-37m-darwin.so   0x0000000104e2d7d4 boost::asio::basic_deadline_timer<boost::posix_time::ptime, boost::asio::time_traits<boost::posix_time::ptime>, boost::asio::executor>::cancel() + 68
3   _pulsar.cpython-37m-darwin.so   0x0000000104f84e6b pulsar::ProducerStatsImpl::~ProducerStatsImpl() + 59
4   _pulsar.cpython-37m-darwin.so   0x0000000104f3986d pulsar::ProducerImpl::~ProducerImpl() + 1021
5   _pulsar.cpython-37m-darwin.so   0x0000000104de4acc boost::python::objects::value_holder<pulsar::Producer>::~value_holder() + 60
6   _pulsar.cpython-37m-darwin.so   0x0000000104fe1274 boost::python::objects::instance_dealloc(_object*) + 52

zbentley commented 3 years ago

I am also affected by this (pulsar client 2.8.0.post0, broker 2.8.0 standalone, Python 3.9.6 on MacOS 10.11). The following script reliably reproduces the segfault:

import pulsar

def get_producer():
    cl = pulsar.Client(
        'pulsar://localhost:6650',
        operation_timeout_seconds=10,
    )
    return cl.create_producer(
        'foobar',
    )

producer = get_producer()
producer.send(
    b'test_payload',
)

zbentley commented 3 years ago

@BewareMyPower I suggest updating this issue's description to reflect affected platforms, e.g. Pulsar client segfaults Python if any Producer instance outlives the garbage collection of any Client instance or similar.

merlimat commented 3 years ago

This problems stems by the fact that in C++ lib, you need to keep the Client instance alive. The Producer only keeps a weak_ptr to the Client to prevent a cyclic dependency.

In the boost::python wrapping, there's no easy way to keep a reference on the C++ Client instance. We'd need to create a custom wrapper class instead of automated wrapping done at https://github.com/apache/pulsar/blob/f784379323bdbd60db2825f3e6e63730d290a8f9/pulsar-client-cpp/python/src/producer.cc#L84

zbentley commented 3 years ago

@merlimat I suggest solving it in Python: have producer objects, when created, store a Python-side reference to the Client that created them. Python's cyclic garbage collector should be able to appropriately handle that cleanup, in python 3 at least. The cyclic GC works "all right", not perfectly, in Python 2, but is much improved in 3. I think this can be done without changes in the native code.

EDIT this might be what you're proposing above, in which case I'm sorry for the duplication.

That's not quite idiomatic, but the typical Python conventions for this kind of thing (use contextmanagers for producers/consumers) would require a larger refactor to the driver and would break back compatibility.

zbentley commented 3 years ago

Though I guess that raises a larger question: for consumers of the C++ API, it might be desirable in some cases to provide a producer object without providing the ability to create additional producers/consumers. If that's the case, it may make sense to alter the C++ code to use shared_ptr and allow producers to keep clients alive, allowing cases where a client is referenced only inside a producer object and not anywhere else. Since I don't code to the C++ API, I am not sure whether that's reasonable.

merlimat commented 3 years ago

Though I guess that raises a larger question: for consumers of the C++ API, it might be desirable in some cases to provide a producer object without providing the ability to create additional producers/consumers. If that's the case, it may make sense to alter the C++ code to use shared_ptr and allow producers to keep clients alive, allowing cases where a client is referenced only inside a producer object and not anywhere else. Since I don't code to the C++ API, I am not sure whether that's reasonable.

Typically it's not a big deal in the sense that you're expected to keep 1 single instance of C++ Pulsar client for the duration of your application. You can still pass on producers instances to sub-components of that application.

merlimat commented 3 years ago

@merlimat I suggest solving it in Python: have producer objects, when created, store a Python-side reference to the Client that created them. Python's cyclic garbage collector should be able to appropriately handle that cleanup, in python 3 at least. The cyclic GC works "all right", not perfectly, in Python 2, but is much improved in 3. I think this can be done without changes in the native code.

EDIT this might be what you're proposing above, in which case I'm sorry for the duplication.

I was actually thinking in fixing that in the C++/Python wrapper, but your suggestion is much easier to implement!

merlimat commented 3 years ago

The cyclic GC works "all right", not perfectly, in Python 2, but is much improved in 3. I think this can be done without changes in the native code.

I wouldn't even need to be a cyclic dependency. There are 2 layers of wrappers:

pure-python object --> boost-python generated wrapper --> C++ class

We just need to retain a reference to the client boost-python object from the pure-python producer object.

apache / pulsar

pulsar-client 2.5 - Segmentation fault: 11 on Mac OS 10.14.6 using Python 3.7.6 #6463