TeaEngineering / libchronicle

Shared-memory interprocess communication from C/python using OpenHFT's chronicle-queue protocol
Apache License 2.0
14 stars 10 forks source link

Python API segmentation faults and memory issues #30

Closed djkelleher closed 1 month ago

djkelleher commented 6 months ago

Thank you for this wonderful library! I love the design and ideas here, but I've been running into errors with segmentation faults when using the Python API.

It seems the reader tailer callback will always throw a seg fault with anything other than very small messages. Message types tested were JSON strings (json.dumps() with Python3 .encode()) and also MessagePack binary format. Behavior/errors were the same for both. Code used for testing was the provided reader_cb.py, reader.py, and writer.py (modified to write random JSON and MessagePack).

Python version: Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] on linux

Example output (reading JSON strings, debug enabled):

 85109072037565 @0x7ff41724248c data size 92
unsigned char* buf="{"k": 474956569, "f": "rakjljkdjlkfjasfjdskfl", "j": "", "d": 580083342, "s": "", "r": "", "l": 294844953.65275264, "a": "rakjljkdjlkfjasfjdskfl"}"
[85109072037565] b'{"k": 474956569, "f": "rakjljkdjlkfjasfjdskfl", "j": "", "d": 580083342, "s": "", "r": "", "l": 294844953.65275264, "a": "rakjljkdjlkfjasfjdskfl"}'
 85109072037566 @0x7ff417242524 data size 9a
unsigned char* buf="{"s": "rakjljkdjlkfjasfjdskfl", "d": "rakjljkdjlkfjasfjdskfl", "f": "", "l": 1712171841.560547, "k": 1712171841.5605578, "j": "", "r": 448669499.14213234}"
Segmentation fault (core dumped)
shuckc commented 2 months ago

Apologies I've only just seen that you opened this issue. Was the crash occurring at a particular number of bytes written or a particular count of messages? I will add some fuzzing appenders/tailers and see if I can reproduce the crash.

shuckc commented 1 month ago

I tried to reproduce by writing random length payloads 1-10kB repeatedly and interspersed with rollovers, with a follower attached and could not get it to segault. I used this (on Ubuntu x86_64):

import libchronicle
import os
import sys
import random
import string
from time import sleep

path = sys.argv[1]
os.makedirs(path, mode=0o777, exist_ok=True)

with libchronicle.Queue(path, version=5, create=True, roll_scheme="TEST_SECONDLY") as q:
    for i in range(int(1e8)):
        N = random.randrange(1, 10000)
        line = "".join(random.choices(string.ascii_uppercase + string.digits, k=N))
        print(q.append(line.rstrip().encode()))
        sleep(random.randrange(1, 100) / 1000.0)

I fixed a bug in the reader - we need to poll for potential new cycles announced via the metadata in chronicle_collect to avoid getting stuck awaiting a gap (ie. where the write is down for multiple cycles then resumes), however this wouldn't cause a segfault.

Can you share more info on your setup? What roll cycle, how many readers/writers do you have? What kind of machine? Could you run the python code within GDB to get a native stacktrace?

shuckc commented 1 month ago

You are right - I cannot make reader.py crash, however reader_cb.py crashes easily:

$ python reader_cb.py /tmp/q7/ 
chronicle: detected version v5
shmipc: /tmp/q7/ modcount changed from 0 to 1
shmipc: tailer added index=0 (cycle=0 seqnum=0) cb=0x7ad6c7e08010
shmipc: opening cycle 1724933846 filename /tmp/q7//20240829-121726T.cq4 (highest_cycle 1724933846)
shmipc:  mmap offset 0 size 200000 base=0x7ad6c6e00000 extent=0x7ad6c7000000
Segmentation fault (core dumped)

$ python reader.py /tmp/q7/ 
chronicle: detected version v5
shmipc: /tmp/q7/ modcount changed from 0 to 1
shmipc: tailer added index=0 (cycle=0 seqnum=0) cb=(nil)
shmipc: opening cycle 1724933846 filename /tmp/q7//20240829-121726T.cq4 (highest_cycle 1724933846)
shmipc:  mmap offset 0 size 200000 base=0x774085600000 extent=0x774085800000
[7408534456333500416] b'K2IV12L7KAQINZ3MHNBXIKBDPHGM6XH77SJAR64F...