apache / pulsar-client-python

Apache Pulsar Python client library
https://pulsar.apache.org/
Apache License 2.0
53 stars 43 forks source link

[Bug] Segmentation fault when properties exceed ~100kb #223

Open emanueledomingo opened 2 months ago

emanueledomingo commented 2 months ago

Hi everyone, i'm running an event driven app and i use apache pulsar as backbone. I use message's properties to exchange metadata between the services.

I noted that, when the properties exceed ~100kb the client gives Segmentation fault (core dumped) (sometimes corrupted double-linked list instead)

I haven't done a proper test on the size, so take the size as a guideline

I was able to simulate the behaviour in a notebook:

import pulsar
import sys
import json

# need to serialize to json nested object because client accepts properties only
# of type (self: _pulsar.MessageBuilder, arg0: str, arg1: str)
nested_dict = json.dumps({
    "foo": "bar"
})

big_dict = {
    f"key{i}": nested_dict for i in range(3000)
}

print(sys.getsizeof(big_dict))  # 103856

client = pulsar.Client('pulsar://localhost:6650')
producer = client.create_producer('test-topic')

producer.send(content="test-message".encode(), properties=big_dict)  # segfault

Unfortunately i don't have details in the stacktrace:

In [4]: producer.send(content="test-message".encode(), properties=big_dict)
Segmentation fault (core dumped)

Pulsar client version is 3.5.0.

lhotari commented 2 months ago

🙈 I made a mistake in moving this issue. I first thought that it was in the apache/pulsar repository and transferred it to pulsar-client-cpp. I'm sorry about that. Perhaps this is pulsar-client-cpp related in any case.

emanueledomingo commented 2 months ago

Ok, thanks! I also think it has to do with the core c++ implementation. But as a python user I wasn't sure.

BewareMyPower commented 2 months ago

Yeah, it's related to the C++ core. And it's a known issue. It might be easy to fix. I will take some time for it when I'm free.

BewareMyPower commented 2 months ago

I just double checked the issue and it looks like a bug with protobuf:

apache/pulsar-client-cpp#1  0x0000ffffb1a58d2c in pulsar::proto::KeyValue::_InternalSerialize(unsigned char*, google::protobuf::io::EpsCopyOutputStream*) const () from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so
apache/pulsar-client-cpp#2  0x0000ffffb1cc50e0 in google::protobuf::internal::WireFormatLite::InternalWriteMessage(int, google::protobuf::MessageLite const&, int, unsigned char*, google::protobuf::io::EpsCopyOutputStream*) ()
   from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so
apache/pulsar-client-cpp#3  0x0000ffffb1a5b718 in pulsar::proto::MessageMetadata::_InternalSerialize(unsigned char*, google::protobuf::io::EpsCopyOutputStream*) const () from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so
apache/pulsar-client-cpp#4  0x0000ffffb1c96478 in google::protobuf::SerializeToArrayImpl(google::protobuf::MessageLite const&, unsigned char*, int) () from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so
apache/pulsar-client-cpp#5  0x0000ffffb1c94e98 in google::protobuf::MessageLite::SerializePartialToArray(void*, int) const () from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so
apache/pulsar-client-cpp#6  0x0000ffffb1c94d50 in google::protobuf::MessageLite::SerializeToArray(void*, int) const () from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so
apache/pulsar-client-cpp#7  0x0000ffffb1920704 in pulsar::Commands::newSend(pulsar::SharedBuffer&, pulsar::proto::BaseCommand&, pulsar::Commands::ChecksumType, pulsar::SendArguments const&) ()
   from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so

I tried the python client on macOS and it does not have this error but encountered the same issue for Ubuntu.

However, I tried the C++ client 3.5.1 on Ubuntu and this error does not happen:

#include <pulsar/Client.h>
using namespace pulsar;

int main() {
    const std::string topic = "test-topic";
    Client client("pulsar://host.docker.internal:6650");
    Producer producer;
    client.createProducer(topic, producer);

    MessageBuilder::StringMap properties;
    for (int i = 0; i < 3000; i++) {
        properties["key" + std::to_string(i)] = "{\"foo\": \"bar\"}";
    }
    auto msg = MessageBuilder().setProperties(properties).setContent("test-message").build();
    producer.send(msg);

    client.close();
}

Increasing the i in loop does not fail with segfault as well. (if the properties are too big, ResultMessageTooLarge will be returned)

It might be an issue from the Python client side. So let me move it to pulsar-client-python first.

BewareMyPower commented 2 months ago

Oh I realized where the issue is, there is a bug with the C++ client when batching is disabled. However, the Python client somehow disables batching by default.

You can manually enable the batching to get the issue around, while I'm going to push a fix at the C++ client.

I also pushed a PR (https://github.com/apache/pulsar-client-python/pull/224) to enable batching by default.

emanueledomingo commented 2 months ago

I confirm that by enabling batching, the bug does not appear. Thanks for the workaround suggestion!