apache / pulsar-client-python

Apache Pulsar Python client library
https://pulsar.apache.org/
Apache License 2.0
49 stars 38 forks source link

Segmentation fault sometimes happens when running unit tests #203

Open BewareMyPower opened 4 months ago

BewareMyPower commented 4 months ago

See

...
2024-03-12T02:35:31.1396530Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ClientConnection:886 | [[::1]:52958 -> [::1]:6650] Handling incoming command: SEND_RECEIPT
2024-03-12T02:35:31.1398821Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ClientConnection:1498 | [[::1]:52958 -> [::1]:6650] Got receipt for producer: 0 -- msg: 3-- message id: (81,3,-1,-1)
2024-03-12T02:35:31.1401720Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ProducerImpl:940 | [persistent://public/default/my-python-topic-keyshare-sticky-1710210931.0501215, standalone-0-57] Received ack for msg 3
2024-03-12T02:35:31.1404238Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ClientConnection:861 | [[::1]:52958 -> [::1]:6650] Received a message from the server for consumer: 1
2024-03-12T02:35:31.1405802Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ConsumerImpl:516 | [persistent://public/default/my-python-topic-keyshare-sticky-1710210931.0501215, my-sub, 1] Received Message -- Size: 7
2024-03-12T02:35:31.1407524Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ConsumerImpl:564 | [persistent://public/default/my-python-topic-keyshare-sticky-1710210931.0501215, my-sub, 1]  metadata.num_messages_in_batch() = 1
2024-03-12T02:35:31.1410975Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ConsumerImpl:566 | [persistent://public/default/my-python-topic-keyshare-sticky-1710210931.0501215, my-sub, 1]  metadata.has_num_messages_in_batch() = 0
2024-03-12T02:35:31.1413569Z 2024-03-12 02:35:31.140 DEBUG [140455413016128] RetryableOperation:129 | Run operation get-partition-metadata-persistent://public/default/partitioned_topic_name_test, remaining time: 29900 ms
2024-03-12T02:35:35.1471950Z ....................................../tests/run-unit-tests.sh: line 28:  3956 Segmentation fault      (core dumped) python3 pulsar_test.py
BewareMyPower commented 4 months ago
2024-03-12 02:35:31.040 INFO  [140455429797440] RetryableOperation:114 | Reschedule get-partition-metadata-persistent://public/default/partitioned_topic_name_test for 100 ms, remaining time: 29900 ms
2024-03-12 02:35:31.040 DEBUG [140455429797440] ExecutorService:44 | Event loop of ExecutorService exits successfully
2024-03-12 02:35:31.040 DEBUG [140457878760320] ClientImpl:775 | listenerExecutorProvider_ is closed
2024-03-12 02:35:31.040 DEBUG [140457878760320] ClientImpl:780 | partitionListenerExecutorProvider_ is closed
2024-03-12 02:35:31.041 DEBUG [140457878760320] ProducerImpl:117 | [persistent://public/default/partitioned_topic_name_test-partition-0, standalone-0-53] ~ProducerImpl
2024-03-12 02:35:31.041 INFO  [140457878760320] ProducerImpl:757 | Producer - [persistent://public/default/partitioned_topic_name_test-partition-0, standalone-0-53] , [batching  = off]
2024-03-12 02:35:31.041 DEBUG [140457878760320] ProducerImpl:117 | [persistent://public/default/partitioned_topic_name_test-partition-1, standalone-0-54] ~ProducerImpl
2024-03-12 02:35:31.041 INFO  [140457878760320] ProducerImpl:757 | Producer - [persistent://public/default/partitioned_topic_name_test-partition-1, standalone-0-54] , [batching  = off]
2024-03-12 02:35:31.041 DEBUG [140457878760320] ProducerImpl:117 | [persistent://public/default/partitioned_topic_name_test-partition-2, standalone-0-55] ~ProducerImpl
2024-03-12 02:35:31.041 INFO  [140457878760320] ProducerImpl:757 | Producer - [persistent://public/default/partitioned_topic_name_test-partition-2, standalone-0-55] , [batching  = off]
# another test...
2024-03-12 02:35:31.140 DEBUG [140455413016128] RetryableOperation:129 | Run operation get-partition-metadata-persistent://public/default/partitioned_topic_name_test, remaining time: 29900 ms

It seems that the RetryableOperation is still scheduled after client.close() is called.

BewareMyPower commented 4 months ago

The root cause might be an incorrect implementation of the shutdown of a partitioned producer.

Here are the output when I ran PulsarTest.test_shutdown_client with debug level logs:

2024-03-12 15:09:35.278 DEBUG [0x1e4d42100] ClientImpl:699 | 1 producers and 0 consumers have been shutdown.
2024-03-12 15:09:35.278 DEBUG [0x16df5f000] ConnectionPool:86 | Got connection from pool for pulsar://localhost:6650-0 use_count: 2 @ 0x15ca15e00
2024-03-12 15:09:35.278 INFO  [0x1e4d42100] ClientConnection:1325 | [127.0.0.1:63468 -> 127.0.0.1:6650] Connection disconnected (refCnt: 2)
2024-03-12 15:09:35.278 DEBUG [0x16de47000] ClientConnection:1262 | [127.0.0.1:63468 -> 127.0.0.1:6650]  Ignoring timer cancelled event, code[system:89]
2024-03-12 15:09:35.278 INFO  [0x1e4d42100] HandlerBase:150 | [persistent://public/default/partitioned_topic_name_test-partition-0, standalone-155-58] Schedule reconnection in 0.1 s
2024-03-12 15:09:35.279 INFO  [0x1e4d42100] HandlerBase:150 | [persistent://public/default/partitioned_topic_name_test-partition-1, standalone-155-59] Schedule reconnection in 0.1 s
2024-03-12 15:09:35.279 DEBUG [0x16df5f000] BinaryProtoLookupService:148 | PartitionMetadataLookup failed for persistent://public/default/partitioned_topic_name_test, result NotConnected
2024-03-12 15:09:35.279 INFO  [0x1e4d42100] HandlerBase:150 | [persistent://public/default/partitioned_topic_name_test-partition-2, standalone-155-60] Schedule reconnection in 0.1 s

As you can see the internal producers still scheduled reconnections after shutdown.

After using another topic name in test_shutdown_client, no reconnection happened:

024-03-12 15:20:36.796 DEBUG [0x1e4d42100] ClientImpl:699 | 1 producers and 0 consumers have been shutdown.
2024-03-12 15:20:36.796 INFO  [0x1e4d42100] ClientConnection:1325 | [127.0.0.1:63850 -> 127.0.0.1:6650] Connection disconnected (refCnt: 1)
2024-03-12 15:20:36.796 DEBUG [0x1e4d42100] ClientImpl:705 | ConnectionPool is closed
2024-03-12 15:20:36.796 DEBUG [0x16fd93000] ClientConnection:1262 | [127.0.0.1:63850 -> 127.0.0.1:6650]  Ignoring timer cancelled event, code[system:89]
2024-03-12 15:20:36.796 INFO  [0x16fd93000] ClientConnection:274 | [127.0.0.1:63850 -> 127.0.0.1:6650] Destroyed connection to pulsar://localhost:6650-0
2024-03-12 15:20:36.796 DEBUG [0x16fd93000] ExecutorService:44 | Event loop of ExecutorService exits successfully
2024-03-12 15:20:36.796 DEBUG [0x1e4d42100] ClientImpl:715 | ioExecutorProvider_ is closed
2024-03-12 15:20:36.796 DEBUG [0x1e4d42100] ClientImpl:720 | listenerExecutorProvider_ is closed
2024-03-12 15:20:36.796 DEBUG [0x1e4d42100] ClientImpl:725 | partitionListenerExecutorProvider_ is closed