Crash in clcache_skip_change()

abbra commented 3 weeks ago

FreeIPA upstream tests encountered a crash in clcache_skip_change:

Thread 10 (Thread 0x7f84f47ff6c0 (LWP 3880)):
#0  0x00007f883d3b600c in fsync () at /lib64/libc.so.6
#1  0x00007f883cbb2522 in pt_Fsync () at /lib64/libnspr4.so
#2  0x00007f883d59d965 in vslapd_log_error () at /usr/lib64/dirsrv/libslapd.so.0
#3  0x00007f883d59dbc0 in slapd_log_error_proc_internal () at /usr/lib64/dirsrv/libslapd.so.0
#4  0x00007f883d59dfd6 in slapi_log_error () at /usr/lib64/dirsrv/libslapd.so.0
#5  0x00007f8838720078 in clcache_skip_change (buf=0x7f84f3239800) at ldap/servers/plugins/replication/cl5_clcache.c:809
        rid = 4
        i = <optimized out>
        buf_cur_csn_str = "6721fc35000100040000"
        cscb = 0x0
        skip = 1
        cscb = <optimized out>
        rid = <optimized out>
        skip = <optimized out>
        i = <optimized out>
        buf_cur_csn_str = {<optimized out> <repeats 21 times>}
#6  clcache_get_next_change (buf=0x7f84f3239800, key=key@entry=0x7f84f47fdf08, keylen=keylen@entry=0x7f84f47fdef8, data=data@entry=0x7f84f47fdf00, datalen=datalen@entry=0x7f84f47fdef0, csn=csn@entry=0x7f84f47fdf10, initial_starting
        dbi_key = {flags = (DBI_VF_PROTECTED | DBI_VF_DONTGROW), data = 0x7f84f3239c35, size = 21, ulen = 21}
        dbi_data = {flags = (DBI_VF_PROTECTED | DBI_VF_DONTGROW), data = 0x7f84f3239c4a, size = 1069, ulen = 1069}
        skip = 1
        rc = 0
#7  0x00007f883872029d in cl5GetNextOperationToReplay (iterator=0x7f84f32117c0, entry=0x7f84f47fdff0) at ldap/servers/plugins/replication/cl5_api.c:1102
        csn = 0x0
        key = 0x7f84f3239c35 "6721fc35000100040000"
        data = 0x7f84f3239c4a "\006"
        keylen = 140223039823872
        datalen = 140223009923728
        agmt_name = 0x7f88379b0f00 "agmt=\"cn=meTomaster1.ipa.test\" (master1:389)"
        rc = 0
#8  0x00007f883873ee69 in send_updates (prp=0x7f8836b22200, remote_update_vector=<optimized out>, num_changes_sent=0x7f84f47fdfa4) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1698
        replay_crc = <optimized out>
        release_timeout = <optimized out>
        finished = 0
        replica = <optimized out>
        csn_str = "6721fba0000000030000"
        return_value = <optimized out>
        rd = 0x7f84f3215a20
        entry = {op = 0x7f84f47fe000, time = 1730280352}
        op = {operation_type = 0, target_address = {udn = 0x0, uniqueid = 0x0, sdn = 0x0}, csn = 0x0, request_controls = 0x0, p = {p_add = {target_entry = 0x0, parentuniqueid = 0x0}, p_bind = {bind_method = 0, bind_creds = 0x0, bin
        rc = <optimized out>
        changelog_iterator = 0x7f84f32117c0
        message_id = 211
        entry = {op = <optimized out>, time = <optimized out>}
        op = {operation_type = <optimized out>, target_address = {udn = <optimized out>, uniqueid = <optimized out>, sdn = <optimized out>}, csn = <optimized out>, request_controls = <optimized out>, p = {p_add = {target_entry = <o
        return_value = <optimized out>
        rc = <optimized out>
        changelog_iterator = <optimized out>
        message_id = <optimized out>
        rd = <optimized out>
        replay_crc = <optimized out>
        replica = <optimized out>
        release_timeout = <optimized out>
        csn_str = {<optimized out> <repeats 21 times>}
        finished = <optimized out>
        operation = <optimized out>
        error = <optimized out>
        uniqueid = <optimized out>
        replica_id = <optimized out>
        operation = <optimized out>
        error = <optimized out>
        sop = <optimized out>
        current_time = <optimized out>
        final_result = <optimized out>

45732.stacktrace.txt

abbra commented 3 weeks ago

coredump: 45732.core.tar.gz

abbra commented 3 weeks ago

This is Fedora 40, 389-ds-base-3.0.4-3.fc40.x86_64

progier389 commented 2 weeks ago

According to the stack trace no threads are executing some code that generates SIGABRT signal: the only active thread is writing a message in error log (and more exactly: flushing that message to the disk) slapi_log_err(SLAPI_LOG_REPL, buf->buf_agmt_name, "clcache_skip_change - Skipping %s because the consumer with Rid: [%d] is ignored\n", buf_cur_csn_str, rid); and buf->buf_agmt_name and buf_cur_csn_str are both valid strings and there is nor reason that the error message fails.

So at this point, everything looks like ns-slpad process was killed by an external signal (maybe too much memory was consumed on the VM and something decided to kill some processes ?)

On Wed, Oct 30, 2024 at 11:01 AM Alexander Bokovoy @.***> wrote:

This is Fedora 40, 389-ds-base-3.0.4-3.fc40.x86_64

— Reply to this email directly, view it on GitHub https://github.com/389ds/389-ds-base/issues/6389#issuecomment-2446400576, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARLA4LKN55KXEK2Q5W5NEG3Z6CU5RAVCNFSM6AAAAABQ3W2UZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBWGQYDANJXGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

389 Directory Server Development Team

Firstyear commented 2 weeks ago

SIGABRT can occur during failed memory allocation, so I would agree with @progier389 that freeipa is simply consuming too much memory.

abbra commented 2 weeks ago

Thank you. It is most likely the case. The VM in question has 6GB RAM and some swap space, we run multiple containers in parallel there and they might cause memory exhaustion. It is a good test bed to imitate high memory pressure in CI tasks, so I'll keep reporting individual issues in future.

I'm closing this one as it appears to be non-actionable here.

Firstyear commented 2 weeks ago

so I'll keep reporting individual issues in future.

I don't see the value in this. If you're running a machine to OOM, there is no bug in 389-ds. We can't allocate memory, when there is no memory to allocate. This is to say nothing of how memory overcommit works and how memory allocations only fail on first write, not on allocate.

If your intent is to "reduce the ram usage of freeipa" then there are more productive ways to go about this without reporting bugs like this.

EDIT: In the interest of clarity, I am saying "do not report bugs that are caused by OOM because you are wasting everyone's collective time".

389ds / 389-ds-base

Crash in clcache_skip_change() #6389

--