persistent_term:put/2 causing OOM

max-au commented 3 years ago

Describe the bug Something (allegedly persistent_term:put/2) causes all normal schedulers to prevent RAM deallocs. Which leads to BEAM OOM with erl_crash.dump "Slogan: eheap_alloc: Cannot allocate 972288 bytes of memory (of type "heap")." All normal schedulers have sleep aux work is "DD_THR_PRGR | THR_PRGR_LATER_OP | MISC_THR_PRGR" (plus some more usual flags for those active).

To Reproduce This bug reproduces after several days of uptime. In the crash dump it's clearly visible that one process is returning from persistent_term:put/2 (current function is bif_return_trap). There are more processes in the update_queue (some waiting to execute put/2, and some to execute get/0).

Affected versions OTP 24RC1. It is likely that bug was planted after October 2020, as earlier builds (including beamasm) do not seem to exhibit this problem (or it is less likely to trigger).

Additional context Core dump taken reveals some more details:

persistent_term:put/2 was executed this way: persistent_term:put({atom, another}, 2) - so key is tuple of two atoms, value is immediate value (small int)
process that called put/2 did not execute table_updater (scheduled with erts_schedule_thr_prgr_later_op here: https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_bif_persistent.c#L433), so updater_process is set to that process (which has both "suspended" and "active" flags)
it is fast update (old and new table are equal)
one of normal schedulers has GARBAGE in aux work flags: Aux work Flags: delayed-dealloc delayed-dealloc-thr-prgr fix-alloc-dealloc fix-alloc-lower-lim later-op canceled-timers canceled-timers-thr-prgr async-ready misc set-tmo yield GARBAGE with run queue flags Misc Flags: inactive non-empty exec
persistent term hash table has 128 allocated and 62 used entries

rickard-green commented 3 years ago

A couple of questions:

What emulator command line arguments do you use when starting the system?
Are you patching the emulator with your own modifications? If so, can you make them available to me?
Can you make the core available to me? If so, I need the beam.smp file and preferably the sha you built from as well.

max-au commented 3 years ago

Relevant emulator arguments: -swt very_low -sbt u -sbwt none -sbwtdcpu none -sbwtdio none -SDio 128 -JPperf true -emu_type frmptr

Emulator is almost unpatched, only 1 commit from https://github.com/WhatsApp/otp/commits/wa_r24_maint may be relevant (https://github.com/WhatsApp/otp/commit/cfbc8ac8f6109e7cd39e3406508b3e2d929a29b1) but it's practically unused.

Sadly I cannot share core dump (and it's actually hard to get one, as normally emulator will write crash dump, not core), but I can check out any ideas, including wildest ones. From crash dumps I can see that there are 5 processes in "suspended" state, one is the process executing persistent_term:put/2 (in bif return trap), and 4 more are suspended waiting to grab updater_process permissions.

It takes about a week to reproduce, and I managed to narrow commit window down to Sep 29 - Jan 27.

rickard-green commented 3 years ago

Ok. If it is possible please run the following in gdb on the core and make it available to me:

p erts_thr_prgr__.current

p *intrnl

set $n = intrnl->managed.no
set $i = 0
while $i < $n
p intrnl->thr[$i].data
set $i++
end

set $i = 0
while $i < erts_no_schedulers
p aligned_sched_sleep_info[$i]
set $i++
end

p aligned_sched_sleep_info[-1]

etp-schedulers

thread apply all bt

You can paste multiple lines into gdb.

rickard-green commented 3 years ago

Some more questions:

Have you only seen this on OTP 24 rc:s and master, or have you seen this in OTP 23 (or other releases) as well?
What hardware are you on?
Modifications of the VM that might be relevant to this are typically use of erts_thr_progress_unmanaged_delay() and ethr_thr_progress_unmanaged_continue(). No such usage in your modifications?

BTW persistent_term is most likely only a victim of another bug.

max-au commented 3 years ago

I have not tried running it on OTP 23. I could not reproduce this bug with master branch cut at this commit: https://github.com/erlang/otp/commit/a78b42e2132e1ad13f7b5dbbfbf2f709b16caf61#diff-e43a403ca25b9c36eada96fc58a854b407f71a77433cdbbfda9bafd801b5eadc (although I cannot guarantee it does not exist before that commit).

I ran it on Centos7, x86-64, dual-Skylake CPU, 80 cores in total.

There are no modifications touching unmanaged threads. This commit https://github.com/WhatsApp/otp/commit/c3fefa4b61e44897dc79835ab3d3d32f70e3a37a blocks thread progress and unblocks immediately after counting, I'd expect it to be visible in thread apply all bt, but no such thread exists.

GARBAGE flag reported by etp-commands turned to be red herring, see #4691 (output below has it applied, so flags named correctly)

> p erts_thr_prgr__.current
$265 = {counter = 913823633}

> p intrnl->thr[$i].data
All threads report either 913823634 or -1, except scheduler 6 (IX5):

--- Scheduler 6 ---
 IX: 5
 CPU Binding: unbound
 Aux work Flags: delayed-dealloc-thr-prgr later-op canceled-timers-thr-prgr misc-thr-prgr
 Sleep Info Flags: sleeping tse waiting
 Pointer: (ErtsSchedulerData*)0x7fd1c2e2d400
 - Run Queue -
  Length: total=0, max=0, high=0, normal=0, low=0, port=0
  Misc Jobs: no
  Misc Flags: out-of-work halftime-out-of-work
  Pointer: (ErtsRunQueue*)0x7fd1c2dd6000

With stack:
#3  0x00007fd22192dfa9 in syscall () 
#4  0x0000000000815ea4 in wait__ (timeout=-1, spincount=0, e=<optimized out>) at pthread/ethr_event.c:152
#5  ethr_event_wait (e=e@entry=0x7fd1d0f40410) at pthread/ethr_event.c:631
#6  0x000000000044dfa7 in erts_tse_wait (ep=0x7fd1d0f40400) at beam/erl_threads.h:2452
#7  thr_prgr_wait (vssi=0x7fd1c2de1a80) at beam/erl_process.c:3062
#8  0x0000000000659336 in block_thread (tpd=<optimized out>, tpd=<optimized out>) at beam/erl_thr_progress.c:1289
#9  0x000000000065acf2 in erts_thr_progress_finalize_wait (tpd=tpd@entry=0x7fd1c2e2d438) at beam/erl_thr_progress.c:915
#10 0x0000000000455ef9 in scheduler_wait (fcalls=fcalls@entry=0x7fd1bec2bbf8, esdp=esdp@entry=0x7fd1c2e2d400, rq=rq@entry=0x7fd1c2dd6000) at beam/erl_thr_progress.h:170
#11 0x0000000000465fbc in erts_schedule (esdp=<optimized out>, p=<optimized out>, calls=<optimized out>) at beam/erl_process.c:9567
#12 0x00007fd1d0f9b2b3 in ?? ()
#13 0x0000000000000000 in ?? 

> (gdb) p aligned_sched_sleep_info[5]
$270 = {ssi = {esdp = 0x7fd1c2e2d400, next = 0x0, prev = 0x0, flags = {counter = 13}, event = 0x7fd1d0f40400, psi = 0x7fd1dc90f1d0, aux_work = {counter = 1188}}, align = <...>}

> p aligned_sched_sleep_info[-1]
$269 = {ssi = {esdp = 0x0, next = 0x7de72000000001, prev = 0x0, flags = {counter = 13}, event = 0x7fd1d0f44ac0, psi = 0x7fd1dc90f1b0, aux_work = {counter = 4}}, align = <...>}

aux_work_data and ssi for the managed thread (scheduler 6) not progressing:

> p ((ErtsSchedulerData*)0x7fd1c2e2d400)->aux_work_data
$272 = {sched_id = 6, esdp = 0x7fd1c2e2d400, ssi = 0x7fd1c2de1a80, current_thr_prgr = 913823633, latest_wakeup = 913823636, misc = {ix = 0, thr_prgr = 913823635}, dd = {thr_prgr = 913823634}, cncld_tmrs = {thr_prgr = 913823634}, later_op = {thr_prgr = 913823636, size = 65536, first = 0x7fd1c0fab078, last = 0x7f9af7a9c340}, async_ready = {need_thr_prgr = 0, thr_prgr = 909520210, queue = 0x7fd1c13589c0}, delayed_wakeup = {next = 18446744073709551615, sched2jix = 0x7fd1c2de7888, jix = -1, job = 0x7fd1c2de7600}, yield = {alcu_blockscan = {current = 0x0, last = 0x0}, ets_all = {ongoing = 0x0, hfrag = 0x0, tab = 0x0, queue = 0x0}}, debug = {wait_completed = {flags = 0, callback = 0x0, arg = 0x0}}}

(gdb) p *((ErtsSchedulerData*)0x7fd1c2e2d400)->ssi
$274 = {esdp = 0x7fd1c2e2d400, next = 0x0, prev = 0x0, flags = {counter = 13}, event = 0x7fd1d0f40400, psi = 0x7fd1dc90f1d0, aux_work = {counter = 1188}}

(gdb) p intrnl->misc.data
$280 = {lflgs = {counter = -2147483627}, block_count = {counter = 16}, blocker_event = {counter = 0}, pref_wakeup_used = {counter = 1}, managed_count = {counter = 83}, managed_id = {counter = 82}, unmanaged_id = {counter = 208}, chk_next_ix = 0, umrefc_ix = {waiting = -1, current = {counter = 0}}}

Additional information:

Thread 23:

#3  0x00007fd22192dfa9 in syscall () 
#4  0x0000000000815ea4 in wait__ (timeout=-1, spincount=0, e=<optimized out>) at pthread/ethr_event.c:152
#5  ethr_event_wait (e=e@entry=0x7fd1d0f40950) at pthread/ethr_event.c:631
#6  0x000000000044dfa7 in erts_tse_wait (ep=0x7fd1d0f40940) at beam/erl_threads.h:2452
#7  thr_prgr_wait (vssi=0x7fd1c2de1fc0) at beam/erl_process.c:3062
#8  0x0000000000659336 in block_thread (tpd=<optimized out>, tpd=<optimized out>) at beam/erl_thr_progress.c:1289
#9  0x000000000065a857 in leader_update (tpd=tpd@entry=0x7fd1c2f03db8) at beam/erl_thr_progress.c:765
#10 erts_thr_progress_leader_update (tpd=tpd@entry=0x7fd1c2f03db8) at beam/erl_thr_progress.c:857
#11 0x00000000004653a9 in erts_schedule (esdp=<optimized out>, p=<optimized out>, calls=<optimized out>) at beam/erl_thr_progress.h:170
#12 0x00007fd1d0f9b2b3 in ?? ()

It is scheduler 5:
--- Scheduler 5 ---
 IX: 4
 CPU Binding: unbound
 Aux work Flags: delayed-aw-wakeup delayed-dealloc delayed-dealloc-thr-prgr fix-alloc-dealloc fix-alloc-lower-lim later-op canceled-timers canceled-timers-thr-prgr misc-thr-prgr misc mseg-cache-check
 Sleep Info Flags:
 Pointer: (ErtsSchedulerData*)0x7fd1c2e23080
 - Run Queue -
  Length: total=0, max=0, high=0, normal=0, low=0, port=0
  Misc Jobs: no
  Misc Flags: non-empty exec
  Pointer: (ErtsRunQueue*)0x7fd1c2dd5e40

Threads 37, 53, 70 and 73 are also in #10 erts_thr_progress_leader_update (tpd=tpd@entry=0x7fd1c2e41b38) at beam/erl_thr_progress.c:857 and several more threads are in #9 0x000000000065acf2 in erts_thr_progress_finalize_wait (tpd=tpd@entry=0x7fd1c2e7f038) at beam/erl_thr_progress.c:915

Interestingly a poll thread (thread 146) is also in this stack:

#2  0x000000000044cf7b in erts_cnd_wait (mtx=0x7fd1b6449200, cnd=0x7fd1b6449228) at beam/erl_threads.h:1804
#3  pt_wait (vbpt=0x7fd1b6449200) at beam/erl_process.c:3215
#4  0x0000000000659336 in block_thread (tpd=<optimized out>, tpd=<optimized out>) at beam/erl_thr_progress.c:1289
#5  0x000000000065acf2 in erts_thr_progress_finalize_wait (tpd=tpd@entry=0x7fd1b6449310) at beam/erl_thr_progress.c:915
#6  0x0000000000790d16 in erts_poll_wait (ps=0x7fd1dc908e80, pr=0x7fd1b6449380, len=0x7fd1b103ae28, tpd=<optimized out>, timeout_time=<optimized out>) at sys/common/erl_poll.c:1975
#7  0x000000000079940d in erts_check_io (psi=psi@entry=0x7fd1dc90f1f0, timeout_time=timeout_time@entry=9223372036854775807, poll_only_thread=poll_only_thread@entry=1) at sys/common/erl_check_io.c:1715
#8  0x000000000044e14a in poll_thread (vbpt=<optimized out>) at beam/erl_process.c:3268
#9  0x000000000081548a in thr_wrapper (vtwd=0x7ffff1a23780) at pthread/ethread.c:122

And same is aux thread (240):

#2  ethr_event_wait (e=e@entry=0x7fd1d0f44ad0) at pthread/ethr_event.c:631
#3  0x000000000044dfa7 in erts_tse_wait (ep=0x7fd1d0f44ac0) at beam/erl_threads.h:2452
#4  thr_prgr_wait (vssi=0x7fd1c2de1900) at beam/erl_process.c:3062
#5  0x0000000000659336 in block_thread (tpd=<optimized out>, tpd=<optimized out>) at beam/erl_thr_progress.c:1289
#6  0x000000000045593e in aux_thread (unused=<optimized out>) at beam/erl_process.c:3156
#7  0x000000000081548a in thr_wrapper (vtwd=0x7ffff1a23780) at pthread/ethread.c:122

I will try to find a way to share more information. Meanwhile, is there any obvious way to understand what caused scheduler 6 to be blocked from progress?

max-au commented 3 years ago

Also, yes, persistent_term is just a victim:

(gdb) p thr_prog_op
$281 = {later = 913823636, <...>}

It is worth noting that workload for the node is somewhat disk-IO heavy (hence 128 dirty I/O schedulers).

rickard-green commented 3 years ago

Meanwhile, is there any obvious way to understand what caused scheduler 6 to be blocked from progress?

Scheduler 6 (as well as the other threads you identified) has blocked due to another thread calling erts_thr_progress_block() or erts_thr_progress_fatal_error_block(). Threads sleeping while the system was blocked will get caught in erts_thr_progress_finalize_wait() and block from there if they wake up before the system has been unblocked. Threads awake when the system was blocked will block via erts_thr_progress_leader_update().

Depending on how you produced the core this may or may not be a bug in the this block functionality. When producing a crash dump we block all managed threads like this in order to get a crashdump that is as consistent as possible. If you sent the emulator an ABRT signal after it had begun blocking in preparation for the crashdump, all of the above is more or less just an effect of that. If this OOM is caused by the thread progress functionality, such blocking makes it very hard to find the bug. You instead want to change the ERTS_DUMP_EXIT in erts_alc_fatal_error() to an ERTS_ABORT_EXIT which instead will cause an abort as quickly as possible. This will make it much easier to track down what happened.

Regardless, it would be useful with all the information that I requested above. You can send mail to me directly at rickard at erlang org, if it is too much information to put in the comments here.

max-au commented 3 years ago

I found that I was following a wrong lead. Actual cause seems to be unrelated. In all dumps I managed to obtain there is always one normal scheduler thread with this stack:

#3  erts_msgq_remove_leading_recv_markers (c_p=c_p@entry=0x7fd179e44d18) at beam/erl_proc_sig_queue.c:3054
#4  0x00000000004ba5e0 in erts_msgq_set_save_first (c_p=0x7fd179e44d18) at beam/erl_proc_sig_queue.h:1689
#5  remove_message (c_p=0x7fd179e44d18, FCALLS=3997, HTOP=0x7fcd74d15ba8, E=0x7fcd74e240a0, active_code_ix=<optimized out>) at beam/jit/instr_msg.cpp:360

It runs in a tight endless loop, because first signal in the inner queue is a recv marker which has in_sigq set to 0:

(gdb)  p *((ErtsRecvMarker *) c_p->sig_qs.first)
$38 = {sig = {common = {next = 0x7fd175b1e258, specific = {next = 0x7fd171bea9e0, attachment = 0x7fd171bea9e0}, tag = 197680}, msg = {next = 0x7fd175b1e258, data = {heap_frag = 0x7fd171bea9e0, attached = 0x7fd171bea9e0}, m = {197680, 59, 10170482561667, 59}}}, prev_next = 0x7fd179e44e68, pass = 1 '\001', set_save = 0 '\000', in_sigq = 0 '\000', in_msgq = 0 '\000', prev_ix = 0 '\000', next_ix = 0 '\000'}

So this line: https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_proc_sig_queue.c#L3207 and this https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_proc_sig_queue.c#L2861 form a loop.

Observation: in all 4 cases I managed to catch, Erlang process stuck in an infinite loop is calling erlang:memory(total), https://github.com/erlang/otp/blob/master/erts/preloaded/src/erlang.erl#L4166

Process info:

(gdb) etp-process-info c_p
  Pid: <0.9798.2>
  State: active-sys | sig-in-q | running | active | prq-prio-normal | usr-prio-normal | act-prio-normal

  Flags: delay-gc
  Current function: erlang:receive_emd/1
  I: #Cp<erlang:receive_emd/3+0xbc>
  Heap size: 196650
  Old-heap size: 196650
  Mbuf size: 0
  Msgq len: 79 (inner=75, outer=4)
  Msgq Flags: on-heap
  Parent: <0.9796.2>
  Pointer: (Process*)0x7fd179e44d18

Messages in the queue are sent from here: https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_alloc.c#L3245 which looks to be a good candidate for selective receive optimisation - make_ref followed by selective receive on the ref.

I am also running build cut on Jan 25 2021, so far with no issues. It could be not enough time to manifest, or an indirect indication of a shorter commit window to look at. I will try to get clearance for the requested information, and in the meantime try to understand more about selective receive.

rickard-green commented 3 years ago

I think (WhatsApp@cfbc8ac) is the prime suspect. I found two bugs in this modification by a quick look.

Receive markers have a pointer to the previous message while in the message queue (between sig_qs.first and *sig_qs.last). Receive markers are normally removed from the front of the message queue as soon as they appear there, but the main process lock might be released while handling signals at which point your code might insert a message at the front of the queue while there is a receive marker at the front which will make the queue inconsistent.
The call erts_msgq_recv_marker_clear(receiver, am_default) is a noop. The atom default was previously used to identify receive markers created by pre OTP 24 compilers, but has been changed to using an internal reference saved in erts_old_recv_marker_id. This also only clears receive markers created by old compilers. The compiler in OTP 24 will use the actual reference being matched in the messages so those will still remain in the message queue and/or signal queue.

There have also been fixes made to the new receive marker implementation since it was first merged into master, so you really want to test on a late master. Preferably rc-2 or later.

rickard-green commented 3 years ago

I've had a closer look at this now. The first bug described above can indeed cause this. When removing a leading receive marker that have previously been preceded by a prepended message, the uninitialized pointer to the preceding element (or actually a pointer to the next pointer pointing to the receive marker prev_next) is used. This will then likely not update sig_qs.first but a next pointer in another element and then set in_sigq = 0. Since we've not updated sig_qs.first as expected, we've just entered an eternal loop.

I'm closing this, since this does not point to any bugs in OTP.

You can send me a mail if you have any questions regarding the implementation of receive markers and the prepend functionality.

rickard-green commented 3 years ago

Hmm, just when I posted the above I realized that the above is not an explanation since prev_next of the receive marker should point to sig_qs.first when inserting the prepended message. The effect would instead be a loss of prepended messages. I'm not sure exactly how we came into this scenario, but I'll continue looking for it a bit. However, in the end you either need to fix or remove the prepend functionality.

max-au commented 3 years ago

Long-term, I'm removing it, short-term I have to update this hack. One thing however bothers me, prepend send is not used by any process running on the node going OOM. I will remove the modification completely (to confirm whether it may cause any issues by just staying in the code) and try gathering more information. Sorry for the inconvenience caused!

rickard-green commented 3 years ago

I've tried to find an explanation for the receive marker entering this state both due to the prepend modification and due to a bug in the receive marker code, but I have not been able to find an explanation. I know I have seen such an eternal loop previously which I have fixed, but I'm not sure the problem was seen in something merged into master or not. As I said earlier, try to use the latest commit possible on master when trying to reproduce this. Reopen this or create a new issue if this should hit again.

Regarding the prepend feature. I've made a prototype implementing this feature (when sending using a local pid) where the sender does not modify the receivers message queue, but instead sends a prepend-message signal. When the signal is received it will prepend the message to the message queue instead of appending it to the message queue as is the case when an ordinary message signal is received. By this, the signal order guarantee of the language is still preserved. The order of the messages in the message queue will of course not reflect the order that the messages was received. This is, however, a much smaller change of the behaviour than violating the signal order guarantee.

This prototype should also scale better and also do not have the potential live-lock issue that the try-grab-main-lock-yield approach have. Receive tracing should also work, but I haven't tested that yet though... I actually have not tested more than via two basic test cases (in bif_SUITE). More test cases should be written for it. The send_prepend_recv_mark test might not be completely stable on all platforms since it measures time, and it is not deterministic how much of the queue is read in all of the cases in the test.

I can see that a feature like this can be useful, but I haven't made up my mind what I think about it yet. Since it needs to clear receive markers when prepending, the sender can drastically change the workload for the receiver just by prepending a single message which I currently think is the largest problem with it.

max-au commented 3 years ago

I am very hesitant on the usability of the feature, as it breaks the language guarantee of "two messages sent from process to another always reach it in the same order". Thank you for prototyping the signal version of it (I should've done it 3 years ago, when R21 was to be released, but signals were changing rapidly so I decided to wait while API gets more stable).

Meanwhile the bug in persistent_term still exists. Emulator aborts when get/0 traps, and erase/1 deletes the table that was iterated over by get/2 (either because table shrinks, or because it had non-immediate term erased). Simple repro:

test() ->
    erts_debug:set_internal_state(available_internal_state, true),
    [spawn_link(fun() -> setter(Seq) end) || Seq <- lists:seq(1, 512)],
    [spawn_link(fun getter/0) || _ <- lists:seq(1, 512)].

getter() ->
    erts_debug:set_internal_state(reds_left, 1),
    _ = persistent_term:get(),
    getter().

setter(Key) ->
    erts_debug:set_internal_state(reds_left, 1),
    persistent_term:erase(Key),
    persistent_term:put(Key, {complex, term}),
    setter(Key).

Debug emulator with +S 2:2 +SDio 1:1 helps to pinpoint the issue. I can come up with a PR unless it's also already planned to fix in 'maint' branch (bug was planted in f1329b378a957).

sverker commented 3 years ago

@max-au Is it this fix #4701 in persistent_term:get/0 that you had in mind?

max-au commented 3 years ago

I believe so. On the original issue, with prepend send off, master branch cut on Mar 1 still has the issue. Can 2ccd19b75 be one fix you had in mind? Although there is no tracing happening, so I'd doubt that.

rickard-green commented 3 years ago

Then it is a bug in the top of master as well. Please run the following in gdb when in the stack frame looping in erts_msgq_remove_leading_recv_markers():

bt

p c_p->sig_qs

p *c_p->sig_qs.recv_mrk_blk

p &c_p->sig_qs.recv_mrk_blk->marker[0]

max-au commented 3 years ago

(gdb) fr 3
#3  erts_msgq_remove_leading_recv_markers (c_p=c_p@entry=0x7f3b1b8f8708) at beam/erl_proc_sig_queue.c:3054
3054    in beam/erl_proc_sig_queue.c
(gdb) p c_p->sig_qs
$2 = {first = 0x7f3b20d92c30, last = 0x7f3b2aa74560, save = 0x7f3b1b8f8858, cont = 0x7f3b48f3d560, cont_last = 0x7f3b1b8f8858, nmsigs = {next = 0x0, last = 0x0}, recv_mrk_blk = 0x0, recv_mrk_uniq = -576460752303423483, len = 73, flags = 2}
(gdb) p *c_p->sig_qs.recv_mrk_blk
Cannot access memory at address 0x0
(gdb) p &c_p->sig_qs.recv_mrk_blk->marker[0]
$3 = (ErtsRecvMarker *) 0x40

rickard-green commented 3 years ago

Thats strange what does p *((ErtsRecvMarker *) c_p->sig_qs.first) print?

rickard-green commented 3 years ago

Also p &c_p->sig_qs.first

max-au commented 3 years ago

My apologies, the data are coming from a core dump different to the one that was 2 days ago.

(gdb) p *((ErtsRecvMarker *) c_p->sig_qs.first)
$4 = {sig = {common = {next = 0x7f3b48f3d560, specific = {next = 0x7f3b20d92c30, attachment = 0x7f3b20d92c30}, tag = 197680}, msg = {next = 0x7f3b48f3d560, data = {heap_frag = 0x7f3b20d92c30, attached = 0x7f3b20d92c30}, m = {197680, 0, 824633720832, 0}}}, prev_next = 0x7f3b1b8f8858, pass = 1 '\001', set_save = 0 '\000', in_sigq = 0 '\000', in_msgq = 0 '\000', prev_ix = 0 '\000', next_ix = 0 '\000'}

(gdb) p &c_p->sig_qs.first
$5 = (ErtsMessage **) 0x7f3b1b8f8840

I am no longer sure I completely disabled prepend send, maybe I missed some automation that I somehow missed. Would it be possible to quickly explain recent changes in recv markers logic? How multiple markers are supposed to work?

rickard-green commented 3 years ago

Previously a pointer in the process structure was used as receive marker. This pointer has been replaced by receive marker elements that are passed through the signal queue into the message queue. The message queue could previously only contain messages, but can as of OTP 24 also contain receive marker elements.

When creating a new reference that potentially may be used in a selective receive, we mark the end of current signal queue by moving the outer signal queue into the private signal queue and then inserting a receive marker entry at the end of the private signal queue. This marker after that gets associated with the reference that potentially later will be used in a selective receive.

When the process later process the private signal queue (the actual reception of signals happen here) and reach a receive marker, it moves the receive marker into the message queue and potentially also set the sig_qs.save pointer if the process already has reached a selective receive that match on the associated reference.

The receive marker can be reused multiple times while living in the message queue. For example by a recursive function matching on the associated reference in a receive expression multiple times. When reused we just look up the marker using the reference and set the sig_qs.save pointer to point to the next field of the marker itself. A receive marker will be removed from the message queue if a receive expression passes over it too many times (i.e., the marker is not used on multiple occasions) or if it reaches the front of the queue (where we have no use for it). When a marker is cleared from a beam instruction, it is just inactivated and will be left in the queue, but next time a receive expression passes over it (or if it reach the front of the queue) or if it needs to be reused, it will be removed. Current implementation allows for a maximum of 8 receive markers used in a process. These are co-located in a block referred to by c_p->sig_qs.recv_mrk_blk.

When removing a receive marker from the message queue, the prev_next field is used to update the next pointer of the preceding message or the sig_qs.first pointer if the marker is first in the queue. In the core above which you printed information from, the prev_next pointer of the marker should have pointed to the sig_qs.first field which it did not, so when removing it something else was updated instead of sig_qs.first.

rickard-green commented 3 years ago

In the core above which you printed information from, the prev_next pointer of the marker should have pointed to the sig_qs.first field which it did not, so when removing it something else was updated instead of sig_qs.first.

However, very close. It points 3 words before the sig_qs.first field which is the ftrace field in the process struct which is very strange...

rickard-green commented 3 years ago

Please print p c_p->ftrace as well

rickard-green commented 3 years ago

Sorry, not 3 words before, it is 3 words after which makes much more sense since this is sig_qs.cont which is the beginning of the private signal queue, so please print p c_p->sig_qs.cont instead just to confirm that this is what happened.

rickard-green commented 3 years ago

I'm quite confident that #4703 will fix this issue.

When a process_info request signal requesting the message queue length is served, the message queue is adjusted to reflect its actual state in the middle of the signal processing (this is in all other cases done when signal processing is done or yields). If a receive marker happened to be in the front of the private signal queue which was moved into the message queue, its prev_next pointer was not updated and erroneously pointed to sig_qs.cont. If this receive marker ended up first in the message queue it erroneously modified sig_qs.cont when being removed instead of sig_qs.first and we ended up in an eternal loop.

max-au commented 3 years ago

(gdb) p c_p->sig_qs.cont
$4 = (ErtsMessage *) 0x7f3b48f3d560

Content of the message:

{#Ref<0.4243994959.3196846152.36253>,3,[{temp_alloc,3,[{mbcs,[{blocks,[{temp_alloc,[{size,<...>}

Which is indeed first message, directly following receive marker:

(gdb) p c_p->sig_qs->first->next
$14 = (ErtsMessage *) 0x7f3b48f3d560
(gdb) p c_p->sig_qs.cont
$18 = (ErtsMessage *) 0x7f3b48f3d560

And of course this process if often checked for queue length with process_info(Pid, message_queue_len). It is one of two load regulators, with another one erlang:memory(total) generating both selective receive markers and messages in the outer queue. I am putting the fix to testing, but it will take some time to confirm.

max-au commented 3 years ago

I did not observe any OOMs with #4703 applied, so likely the fix worked.

On the unrelated note, prepend send prototype (via prepend signal) segfaults doing GC

#5  0x0000000000678ceb in erts_free (ptr=0x7ff393580170, type=850) at beam/erl_alloc.h:281
#6  cleanup_rootset (rootset=0x7ff3702fe610) at beam/erl_gc.c:2631
#7  do_minor (p=p@entry=0x7ff35f814470, live_hf_end=live_hf_end@entry=0xfffffffffffffff8, mature=mature@entry=0x7ff2c1f73028 "", mature_size=mature_size@entry=107760, new_sz=121536,
    objv=objv@entry=0x7ff2aefd56f0, nobj=0) at beam/erl_gc.c:1574
#9  garbage_collect (p=p@entry=0x7ff35f814470, live_hf_end=live_hf_end@entry=0xfffffffffffffff8, need=need@entry=0, objv=objv@entry=0x7ff2aefd56f0, nobj=nobj@entry=0, fcalls=fcalls@entry=4000,
    max_young_gen_usage=0) at beam/erl_gc.c:740
#10 0x000000000067e8fa in erts_garbage_collect_nobump (p=p@entry=0x7ff35f814470, need=need@entry=0, objv=0x7ff2aefd56f0, nobj=nobj@entry=0, fcalls=4000) at beam/erl_gc.c:882
#11 0x00000000004671e2 in scheduler_gc_proc (reds_left=4000, c_p=0x7ff35f814470) at beam/erl_process.c:9234
#12 erts_schedule (esdp=<optimized out>, p=0x7ff35f814470, calls=<optimized out>) at beam/erl_process.c:9961

(gdb) p p->sig_qs
$1 = {first = 0x0, last = 0x7ff35f8145b0, save = 0x7ff35f8145b0, cont = 0x7ff37374ea00, cont_last = 0x7ff37373f1f0, nmsigs = {next = 0x7ff37379e9e0, last = 0x7ff3737ab940}, recv_mrk_blk = 0x0,
  len = 48, flags = 2

(gdb) etp-process-info p
  Pid: <0.9823.2>
  State: sig-q | active-sys | sig-in-q | garbage-collecting | running | active | prq-prio-normal | usr-prio-normal | act-prio-normal

  Flags: trap-exit using-db timo
  Current function: unknown
  I: #Cp<gen_server:loop/7+0x260>
  Heap size: 75113
  Old-heap size: 514838
  Mbuf size: 12306
  Msgq len: 142 (inner=48, outer=94)

When p->sig_qs.nmsigs.next contains prepend signal. I did not spend enough time to fully understand how prepend signal copies the message, so cannot pinpoint the cause yet.

rickard-green commented 3 years ago

I'll have a look at the prepend signal issue, but I don't think I'll have time for it this week.

erlang / otp

persistent_term:put/2 causing OOM #4681