Closed max-au closed 3 years ago
A couple of questions:
core
available to me? If so, I need the beam.smp
file and preferably the sha you built from as well.Relevant emulator arguments:
-swt very_low -sbt u -sbwt none -sbwtdcpu none -sbwtdio none -SDio 128 -JPperf true -emu_type frmptr
Emulator is almost unpatched, only 1 commit from https://github.com/WhatsApp/otp/commits/wa_r24_maint may be relevant (https://github.com/WhatsApp/otp/commit/cfbc8ac8f6109e7cd39e3406508b3e2d929a29b1) but it's practically unused.
Sadly I cannot share core dump (and it's actually hard to get one, as normally emulator will write crash dump, not core), but I can check out any ideas, including wildest ones. From crash dumps I can see that there are 5 processes in "suspended" state, one is the process executing persistent_term:put/2
(in bif return trap), and 4 more are suspended waiting to grab updater_process
permissions.
It takes about a week to reproduce, and I managed to narrow commit window down to Sep 29 - Jan 27.
Ok. If it is possible please run the following in gdb on the core and make it available to me:
p erts_thr_prgr__.current
p *intrnl
set $n = intrnl->managed.no
set $i = 0
while $i < $n
p intrnl->thr[$i].data
set $i++
end
set $i = 0
while $i < erts_no_schedulers
p aligned_sched_sleep_info[$i]
set $i++
end
p aligned_sched_sleep_info[-1]
etp-schedulers
thread apply all bt
You can paste multiple lines into gdb.
Some more questions:
master
, or have you seen this in OTP 23 (or other releases) as well?erts_thr_progress_unmanaged_delay()
and ethr_thr_progress_unmanaged_continue()
. No such usage in your modifications?BTW persistent_term
is most likely only a victim of another bug.
I have not tried running it on OTP 23. I could not reproduce this bug with master
branch cut at this commit: https://github.com/erlang/otp/commit/a78b42e2132e1ad13f7b5dbbfbf2f709b16caf61#diff-e43a403ca25b9c36eada96fc58a854b407f71a77433cdbbfda9bafd801b5eadc (although I cannot guarantee it does not exist before that commit).
I ran it on Centos7, x86-64, dual-Skylake CPU, 80 cores in total.
There are no modifications touching unmanaged threads. This commit https://github.com/WhatsApp/otp/commit/c3fefa4b61e44897dc79835ab3d3d32f70e3a37a blocks thread progress and unblocks immediately after counting, I'd expect it to be visible in thread apply all bt
, but no such thread exists.
GARBAGE
flag reported by etp-commands turned to be red herring, see #4691 (output below has it applied, so flags named correctly)
> p erts_thr_prgr__.current
$265 = {counter = 913823633}
> p intrnl->thr[$i].data
All threads report either 913823634 or -1, except scheduler 6 (IX5):
--- Scheduler 6 ---
IX: 5
CPU Binding: unbound
Aux work Flags: delayed-dealloc-thr-prgr later-op canceled-timers-thr-prgr misc-thr-prgr
Sleep Info Flags: sleeping tse waiting
Pointer: (ErtsSchedulerData*)0x7fd1c2e2d400
- Run Queue -
Length: total=0, max=0, high=0, normal=0, low=0, port=0
Misc Jobs: no
Misc Flags: out-of-work halftime-out-of-work
Pointer: (ErtsRunQueue*)0x7fd1c2dd6000
With stack:
#3 0x00007fd22192dfa9 in syscall ()
#4 0x0000000000815ea4 in wait__ (timeout=-1, spincount=0, e=<optimized out>) at pthread/ethr_event.c:152
#5 ethr_event_wait (e=e@entry=0x7fd1d0f40410) at pthread/ethr_event.c:631
#6 0x000000000044dfa7 in erts_tse_wait (ep=0x7fd1d0f40400) at beam/erl_threads.h:2452
#7 thr_prgr_wait (vssi=0x7fd1c2de1a80) at beam/erl_process.c:3062
#8 0x0000000000659336 in block_thread (tpd=<optimized out>, tpd=<optimized out>) at beam/erl_thr_progress.c:1289
#9 0x000000000065acf2 in erts_thr_progress_finalize_wait (tpd=tpd@entry=0x7fd1c2e2d438) at beam/erl_thr_progress.c:915
#10 0x0000000000455ef9 in scheduler_wait (fcalls=fcalls@entry=0x7fd1bec2bbf8, esdp=esdp@entry=0x7fd1c2e2d400, rq=rq@entry=0x7fd1c2dd6000) at beam/erl_thr_progress.h:170
#11 0x0000000000465fbc in erts_schedule (esdp=<optimized out>, p=<optimized out>, calls=<optimized out>) at beam/erl_process.c:9567
#12 0x00007fd1d0f9b2b3 in ?? ()
#13 0x0000000000000000 in ??
> (gdb) p aligned_sched_sleep_info[5]
$270 = {ssi = {esdp = 0x7fd1c2e2d400, next = 0x0, prev = 0x0, flags = {counter = 13}, event = 0x7fd1d0f40400, psi = 0x7fd1dc90f1d0, aux_work = {counter = 1188}}, align = <...>}
> p aligned_sched_sleep_info[-1]
$269 = {ssi = {esdp = 0x0, next = 0x7de72000000001, prev = 0x0, flags = {counter = 13}, event = 0x7fd1d0f44ac0, psi = 0x7fd1dc90f1b0, aux_work = {counter = 4}}, align = <...>}
aux_work_data and ssi for the managed thread (scheduler 6) not progressing:
> p ((ErtsSchedulerData*)0x7fd1c2e2d400)->aux_work_data
$272 = {sched_id = 6, esdp = 0x7fd1c2e2d400, ssi = 0x7fd1c2de1a80, current_thr_prgr = 913823633, latest_wakeup = 913823636, misc = {ix = 0, thr_prgr = 913823635}, dd = {thr_prgr = 913823634}, cncld_tmrs = {thr_prgr = 913823634}, later_op = {thr_prgr = 913823636, size = 65536, first = 0x7fd1c0fab078, last = 0x7f9af7a9c340}, async_ready = {need_thr_prgr = 0, thr_prgr = 909520210, queue = 0x7fd1c13589c0}, delayed_wakeup = {next = 18446744073709551615, sched2jix = 0x7fd1c2de7888, jix = -1, job = 0x7fd1c2de7600}, yield = {alcu_blockscan = {current = 0x0, last = 0x0}, ets_all = {ongoing = 0x0, hfrag = 0x0, tab = 0x0, queue = 0x0}}, debug = {wait_completed = {flags = 0, callback = 0x0, arg = 0x0}}}
(gdb) p *((ErtsSchedulerData*)0x7fd1c2e2d400)->ssi
$274 = {esdp = 0x7fd1c2e2d400, next = 0x0, prev = 0x0, flags = {counter = 13}, event = 0x7fd1d0f40400, psi = 0x7fd1dc90f1d0, aux_work = {counter = 1188}}
(gdb) p intrnl->misc.data
$280 = {lflgs = {counter = -2147483627}, block_count = {counter = 16}, blocker_event = {counter = 0}, pref_wakeup_used = {counter = 1}, managed_count = {counter = 83}, managed_id = {counter = 82}, unmanaged_id = {counter = 208}, chk_next_ix = 0, umrefc_ix = {waiting = -1, current = {counter = 0}}}
Additional information:
Thread 23:
#3 0x00007fd22192dfa9 in syscall ()
#4 0x0000000000815ea4 in wait__ (timeout=-1, spincount=0, e=<optimized out>) at pthread/ethr_event.c:152
#5 ethr_event_wait (e=e@entry=0x7fd1d0f40950) at pthread/ethr_event.c:631
#6 0x000000000044dfa7 in erts_tse_wait (ep=0x7fd1d0f40940) at beam/erl_threads.h:2452
#7 thr_prgr_wait (vssi=0x7fd1c2de1fc0) at beam/erl_process.c:3062
#8 0x0000000000659336 in block_thread (tpd=<optimized out>, tpd=<optimized out>) at beam/erl_thr_progress.c:1289
#9 0x000000000065a857 in leader_update (tpd=tpd@entry=0x7fd1c2f03db8) at beam/erl_thr_progress.c:765
#10 erts_thr_progress_leader_update (tpd=tpd@entry=0x7fd1c2f03db8) at beam/erl_thr_progress.c:857
#11 0x00000000004653a9 in erts_schedule (esdp=<optimized out>, p=<optimized out>, calls=<optimized out>) at beam/erl_thr_progress.h:170
#12 0x00007fd1d0f9b2b3 in ?? ()
It is scheduler 5:
--- Scheduler 5 ---
IX: 4
CPU Binding: unbound
Aux work Flags: delayed-aw-wakeup delayed-dealloc delayed-dealloc-thr-prgr fix-alloc-dealloc fix-alloc-lower-lim later-op canceled-timers canceled-timers-thr-prgr misc-thr-prgr misc mseg-cache-check
Sleep Info Flags:
Pointer: (ErtsSchedulerData*)0x7fd1c2e23080
- Run Queue -
Length: total=0, max=0, high=0, normal=0, low=0, port=0
Misc Jobs: no
Misc Flags: non-empty exec
Pointer: (ErtsRunQueue*)0x7fd1c2dd5e40
Threads 37, 53, 70 and 73 are also in #10 erts_thr_progress_leader_update (tpd=tpd@entry=0x7fd1c2e41b38) at beam/erl_thr_progress.c:857
and several more threads are in #9 0x000000000065acf2 in erts_thr_progress_finalize_wait (tpd=tpd@entry=0x7fd1c2e7f038) at beam/erl_thr_progress.c:915
Interestingly a poll thread (thread 146) is also in this stack:
#2 0x000000000044cf7b in erts_cnd_wait (mtx=0x7fd1b6449200, cnd=0x7fd1b6449228) at beam/erl_threads.h:1804
#3 pt_wait (vbpt=0x7fd1b6449200) at beam/erl_process.c:3215
#4 0x0000000000659336 in block_thread (tpd=<optimized out>, tpd=<optimized out>) at beam/erl_thr_progress.c:1289
#5 0x000000000065acf2 in erts_thr_progress_finalize_wait (tpd=tpd@entry=0x7fd1b6449310) at beam/erl_thr_progress.c:915
#6 0x0000000000790d16 in erts_poll_wait (ps=0x7fd1dc908e80, pr=0x7fd1b6449380, len=0x7fd1b103ae28, tpd=<optimized out>, timeout_time=<optimized out>) at sys/common/erl_poll.c:1975
#7 0x000000000079940d in erts_check_io (psi=psi@entry=0x7fd1dc90f1f0, timeout_time=timeout_time@entry=9223372036854775807, poll_only_thread=poll_only_thread@entry=1) at sys/common/erl_check_io.c:1715
#8 0x000000000044e14a in poll_thread (vbpt=<optimized out>) at beam/erl_process.c:3268
#9 0x000000000081548a in thr_wrapper (vtwd=0x7ffff1a23780) at pthread/ethread.c:122
And same is aux thread (240):
#2 ethr_event_wait (e=e@entry=0x7fd1d0f44ad0) at pthread/ethr_event.c:631
#3 0x000000000044dfa7 in erts_tse_wait (ep=0x7fd1d0f44ac0) at beam/erl_threads.h:2452
#4 thr_prgr_wait (vssi=0x7fd1c2de1900) at beam/erl_process.c:3062
#5 0x0000000000659336 in block_thread (tpd=<optimized out>, tpd=<optimized out>) at beam/erl_thr_progress.c:1289
#6 0x000000000045593e in aux_thread (unused=<optimized out>) at beam/erl_process.c:3156
#7 0x000000000081548a in thr_wrapper (vtwd=0x7ffff1a23780) at pthread/ethread.c:122
I will try to find a way to share more information. Meanwhile, is there any obvious way to understand what caused scheduler 6 to be blocked from progress?
Also, yes, persistent_term
is just a victim:
(gdb) p thr_prog_op
$281 = {later = 913823636, <...>}
It is worth noting that workload for the node is somewhat disk-IO heavy (hence 128 dirty I/O schedulers).
Meanwhile, is there any obvious way to understand what caused scheduler 6 to be blocked from progress?
Scheduler 6 (as well as the other threads you identified) has blocked due to another thread calling erts_thr_progress_block()
or erts_thr_progress_fatal_error_block()
. Threads sleeping while the system was blocked will get caught in erts_thr_progress_finalize_wait()
and block from there if they wake up before the system has been unblocked. Threads awake when the system was blocked will block via erts_thr_progress_leader_update()
.
Depending on how you produced the core this may or may not be a bug in the this block functionality. When producing a crash dump we block all managed threads like this in order to get a crashdump that is as consistent as possible. If you sent the emulator an ABRT
signal after it had begun blocking in preparation for the crashdump, all of the above is more or less just an effect of that. If this OOM is caused by the thread progress functionality, such blocking makes it very hard to find the bug. You instead want to change the ERTS_DUMP_EXIT
in erts_alc_fatal_error()
to an ERTS_ABORT_EXIT
which instead will cause an abort as quickly as possible. This will make it much easier to track down what happened.
Regardless, it would be useful with all the information that I requested above. You can send mail to me directly at rickard at erlang org, if it is too much information to put in the comments here.
I found that I was following a wrong lead. Actual cause seems to be unrelated. In all dumps I managed to obtain there is always one normal scheduler thread with this stack:
#3 erts_msgq_remove_leading_recv_markers (c_p=c_p@entry=0x7fd179e44d18) at beam/erl_proc_sig_queue.c:3054
#4 0x00000000004ba5e0 in erts_msgq_set_save_first (c_p=0x7fd179e44d18) at beam/erl_proc_sig_queue.h:1689
#5 remove_message (c_p=0x7fd179e44d18, FCALLS=3997, HTOP=0x7fcd74d15ba8, E=0x7fcd74e240a0, active_code_ix=<optimized out>) at beam/jit/instr_msg.cpp:360
It runs in a tight endless loop, because first signal in the inner queue is a recv marker which has in_sigq
set to 0:
(gdb) p *((ErtsRecvMarker *) c_p->sig_qs.first)
$38 = {sig = {common = {next = 0x7fd175b1e258, specific = {next = 0x7fd171bea9e0, attachment = 0x7fd171bea9e0}, tag = 197680}, msg = {next = 0x7fd175b1e258, data = {heap_frag = 0x7fd171bea9e0, attached = 0x7fd171bea9e0}, m = {197680, 59, 10170482561667, 59}}}, prev_next = 0x7fd179e44e68, pass = 1 '\001', set_save = 0 '\000', in_sigq = 0 '\000', in_msgq = 0 '\000', prev_ix = 0 '\000', next_ix = 0 '\000'}
So this line: https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_proc_sig_queue.c#L3207 and this https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_proc_sig_queue.c#L2861 form a loop.
Observation: in all 4 cases I managed to catch, Erlang process stuck in an infinite loop is calling erlang:memory(total)
, https://github.com/erlang/otp/blob/master/erts/preloaded/src/erlang.erl#L4166
Process info:
(gdb) etp-process-info c_p
Pid: <0.9798.2>
State: active-sys | sig-in-q | running | active | prq-prio-normal | usr-prio-normal | act-prio-normal
Flags: delay-gc
Current function: erlang:receive_emd/1
I: #Cp<erlang:receive_emd/3+0xbc>
Heap size: 196650
Old-heap size: 196650
Mbuf size: 0
Msgq len: 79 (inner=75, outer=4)
Msgq Flags: on-heap
Parent: <0.9796.2>
Pointer: (Process*)0x7fd179e44d18
Messages in the queue are sent from here: https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_alloc.c#L3245 which looks to be a good candidate for selective receive optimisation
- make_ref followed by selective receive on the ref.
I am also running build cut on Jan 25 2021, so far with no issues. It could be not enough time to manifest, or an indirect indication of a shorter commit window to look at. I will try to get clearance for the requested information, and in the meantime try to understand more about selective receive.
I think (WhatsApp@cfbc8ac) is the prime suspect. I found two bugs in this modification by a quick look.
sig_qs.first
and *sig_qs.last
). Receive markers are normally removed from the front of the message queue as soon as they appear there, but the main process lock might be released while handling signals at which point your code might insert a message at the front of the queue while there is a receive marker at the front which will make the queue inconsistent. erts_msgq_recv_marker_clear(receiver, am_default)
is a noop. The atom default was previously used to identify receive markers created by pre OTP 24 compilers, but has been changed to using an internal reference saved in erts_old_recv_marker_id
. This also only clears receive markers created by old compilers. The compiler in OTP 24 will use the actual reference being matched in the messages so those will still remain in the message queue and/or signal queue.There have also been fixes made to the new receive marker implementation since it was first merged into master
, so you really want to test on a late master
. Preferably rc-2 or later.
I've had a closer look at this now. The first bug described above can indeed cause this. When removing a leading receive marker that have previously been preceded by a prepended message, the uninitialized pointer to the preceding element (or actually a pointer to the next pointer pointing to the receive marker prev_next
) is used. This will then likely not update sig_qs.first
but a next
pointer in another element and then set in_sigq = 0
. Since we've not updated sig_qs.first
as expected, we've just entered an eternal loop.
I'm closing this, since this does not point to any bugs in OTP.
You can send me a mail if you have any questions regarding the implementation of receive markers and the prepend functionality.
Hmm, just when I posted the above I realized that the above is not an explanation since prev_next
of the receive marker should point to sig_qs.first
when inserting the prepended message. The effect would instead be a loss of prepended messages. I'm not sure exactly how we came into this scenario, but I'll continue looking for it a bit. However, in the end you either need to fix or remove the prepend functionality.
Long-term, I'm removing it, short-term I have to update this hack. One thing however bothers me, prepend
send is not used by any process running on the node going OOM. I will remove the modification completely (to confirm whether it may cause any issues by just staying in the code) and try gathering more information. Sorry for the inconvenience caused!
I've tried to find an explanation for the receive marker entering this state both due to the prepend
modification and due to a bug in the receive marker code, but I have not been able to find an explanation. I know I have seen such an eternal loop previously which I have fixed, but I'm not sure the problem was seen in something merged into master
or not. As I said earlier, try to use the latest commit possible on master
when trying to reproduce this. Reopen this or create a new issue if this should hit again.
Regarding the prepend
feature. I've made a prototype implementing this feature (when sending using a local pid) where the sender does not modify the receivers message queue, but instead sends a prepend-message signal
. When the signal is received it will prepend the message to the message queue instead of appending it to the message queue as is the case when an ordinary message signal
is received. By this, the signal order guarantee of the language is still preserved. The order of the messages in the message queue will of course not reflect the order that the messages was received. This is, however, a much smaller change of the behaviour than violating the signal order guarantee.
This prototype should also scale better and also do not have the potential live-lock issue that the try-grab-main-lock-yield approach have. Receive tracing should also work, but I haven't tested that yet though... I actually have not tested more than via two basic test cases (in bif_SUITE
). More test cases should be written for it. The send_prepend_recv_mark
test might not be completely stable on all platforms since it measures time, and it is not deterministic how much of the queue is read in all of the cases in the test.
I can see that a feature like this can be useful, but I haven't made up my mind what I think about it yet. Since it needs to clear receive markers when prepending, the sender can drastically change the workload for the receiver just by prepending a single message which I currently think is the largest problem with it.
I am very hesitant on the usability of the feature, as it breaks the language guarantee of "two messages sent from process to another always reach it in the same order". Thank you for prototyping the signal version of it (I should've done it 3 years ago, when R21 was to be released, but signals were changing rapidly so I decided to wait while API gets more stable).
Meanwhile the bug in persistent_term still exists. Emulator aborts when get/0 traps, and erase/1 deletes the table that was iterated over by get/2 (either because table shrinks, or because it had non-immediate term erased). Simple repro:
test() ->
erts_debug:set_internal_state(available_internal_state, true),
[spawn_link(fun() -> setter(Seq) end) || Seq <- lists:seq(1, 512)],
[spawn_link(fun getter/0) || _ <- lists:seq(1, 512)].
getter() ->
erts_debug:set_internal_state(reds_left, 1),
_ = persistent_term:get(),
getter().
setter(Key) ->
erts_debug:set_internal_state(reds_left, 1),
persistent_term:erase(Key),
persistent_term:put(Key, {complex, term}),
setter(Key).
Debug emulator with +S 2:2 +SDio 1:1
helps to pinpoint the issue. I can come up with a PR unless it's also already planned to fix in 'maint' branch (bug was planted in f1329b378a957).
@max-au Is it this fix #4701 in persistent_term:get/0 that you had in mind?
I believe so. On the original issue, with prepend send off, master branch cut on Mar 1 still has the issue. Can 2ccd19b75 be one fix you had in mind? Although there is no tracing happening, so I'd doubt that.
Then it is a bug in the top of master as well. Please run the following in gdb when in the stack frame looping in erts_msgq_remove_leading_recv_markers()
:
bt
p c_p->sig_qs
p *c_p->sig_qs.recv_mrk_blk
p &c_p->sig_qs.recv_mrk_blk->marker[0]
(gdb) fr 3
#3 erts_msgq_remove_leading_recv_markers (c_p=c_p@entry=0x7f3b1b8f8708) at beam/erl_proc_sig_queue.c:3054
3054 in beam/erl_proc_sig_queue.c
(gdb) p c_p->sig_qs
$2 = {first = 0x7f3b20d92c30, last = 0x7f3b2aa74560, save = 0x7f3b1b8f8858, cont = 0x7f3b48f3d560, cont_last = 0x7f3b1b8f8858, nmsigs = {next = 0x0, last = 0x0}, recv_mrk_blk = 0x0, recv_mrk_uniq = -576460752303423483, len = 73, flags = 2}
(gdb) p *c_p->sig_qs.recv_mrk_blk
Cannot access memory at address 0x0
(gdb) p &c_p->sig_qs.recv_mrk_blk->marker[0]
$3 = (ErtsRecvMarker *) 0x40
Thats strange what does p *((ErtsRecvMarker *) c_p->sig_qs.first)
print?
Also p &c_p->sig_qs.first
My apologies, the data are coming from a core dump different to the one that was 2 days ago.
(gdb) p *((ErtsRecvMarker *) c_p->sig_qs.first)
$4 = {sig = {common = {next = 0x7f3b48f3d560, specific = {next = 0x7f3b20d92c30, attachment = 0x7f3b20d92c30}, tag = 197680}, msg = {next = 0x7f3b48f3d560, data = {heap_frag = 0x7f3b20d92c30, attached = 0x7f3b20d92c30}, m = {197680, 0, 824633720832, 0}}}, prev_next = 0x7f3b1b8f8858, pass = 1 '\001', set_save = 0 '\000', in_sigq = 0 '\000', in_msgq = 0 '\000', prev_ix = 0 '\000', next_ix = 0 '\000'}
(gdb) p &c_p->sig_qs.first
$5 = (ErtsMessage **) 0x7f3b1b8f8840
I am no longer sure I completely disabled prepend send
, maybe I missed some automation that I somehow missed. Would it be possible to quickly explain recent changes in recv markers logic? How multiple markers are supposed to work?
Previously a pointer in the process structure was used as receive marker. This pointer has been replaced by receive marker elements that are passed through the signal queue into the message queue. The message queue could previously only contain messages, but can as of OTP 24 also contain receive marker elements.
When creating a new reference that potentially may be used in a selective receive, we mark the end of current signal queue by moving the outer signal queue into the private signal queue and then inserting a receive marker entry at the end of the private signal queue. This marker after that gets associated with the reference that potentially later will be used in a selective receive.
When the process later process the private signal queue (the actual reception of signals happen here) and reach a receive marker, it moves the receive marker into the message queue and potentially also set the sig_qs.save
pointer if the process already has reached a selective receive that match on the associated reference.
The receive marker can be reused multiple times while living in the message queue. For example by a recursive function matching on the associated reference in a receive expression multiple times. When reused we just look up the marker using the reference and set the sig_qs.save
pointer to point to the next
field of the marker itself. A receive marker will be removed from the message queue if a receive expression passes over it too many times (i.e., the marker is not used on multiple occasions) or if it reaches the front of the queue (where we have no use for it). When a marker is cleared from a beam instruction, it is just inactivated and will be left in the queue, but next time a receive expression passes over it (or if it reach the front of the queue) or if it needs to be reused, it will be removed. Current implementation allows for a maximum of 8 receive markers used in a process. These are co-located in a block referred to by c_p->sig_qs.recv_mrk_blk
.
When removing a receive marker from the message queue, the prev_next
field is used to update the next
pointer of the preceding message or the sig_qs.first
pointer if the marker is first in the queue. In the core
above which you printed information from, the prev_next
pointer of the marker should have pointed to the sig_qs.first
field which it did not, so when removing it something else was updated instead of sig_qs.first
.
In the core above which you printed information from, the prev_next pointer of the marker should have pointed to the sig_qs.first field which it did not, so when removing it something else was updated instead of sig_qs.first.
However, very close. It points 3 words before the sig_qs.first
field which is the ftrace
field in the process struct which is very strange...
Please print p c_p->ftrace
as well
Sorry, not 3 words before, it is 3 words after which makes much more sense since this is sig_qs.cont
which is the beginning of the private signal queue, so please print p c_p->sig_qs.cont
instead just to confirm that this is what happened.
I'm quite confident that #4703 will fix this issue.
When a process_info
request signal requesting the message queue length is served, the message queue is adjusted to reflect its actual state in the middle of the signal processing (this is in all other cases done when signal processing is done or yields). If a receive marker happened to be in the front of the private signal queue which was moved into the message queue, its prev_next
pointer was not updated and erroneously pointed to sig_qs.cont
. If this receive marker ended up first in the message queue it erroneously modified sig_qs.cont
when being removed instead of sig_qs.first
and we ended up in an eternal loop.
(gdb) p c_p->sig_qs.cont
$4 = (ErtsMessage *) 0x7f3b48f3d560
Content of the message:
{#Ref<0.4243994959.3196846152.36253>,3,[{temp_alloc,3,[{mbcs,[{blocks,[{temp_alloc,[{size,<...>}
Which is indeed first message, directly following receive marker:
(gdb) p c_p->sig_qs->first->next
$14 = (ErtsMessage *) 0x7f3b48f3d560
(gdb) p c_p->sig_qs.cont
$18 = (ErtsMessage *) 0x7f3b48f3d560
And of course this process if often checked for queue length with process_info(Pid, message_queue_len)
. It is one of two load regulators, with another one erlang:memory(total)
generating both selective receive markers and messages in the outer queue.
I am putting the fix to testing, but it will take some time to confirm.
I did not observe any OOMs with #4703 applied, so likely the fix worked.
On the unrelated note, prepend send prototype (via prepend signal) segfaults doing GC
#5 0x0000000000678ceb in erts_free (ptr=0x7ff393580170, type=850) at beam/erl_alloc.h:281
#6 cleanup_rootset (rootset=0x7ff3702fe610) at beam/erl_gc.c:2631
#7 do_minor (p=p@entry=0x7ff35f814470, live_hf_end=live_hf_end@entry=0xfffffffffffffff8, mature=mature@entry=0x7ff2c1f73028 "", mature_size=mature_size@entry=107760, new_sz=121536,
objv=objv@entry=0x7ff2aefd56f0, nobj=0) at beam/erl_gc.c:1574
#9 garbage_collect (p=p@entry=0x7ff35f814470, live_hf_end=live_hf_end@entry=0xfffffffffffffff8, need=need@entry=0, objv=objv@entry=0x7ff2aefd56f0, nobj=nobj@entry=0, fcalls=fcalls@entry=4000,
max_young_gen_usage=0) at beam/erl_gc.c:740
#10 0x000000000067e8fa in erts_garbage_collect_nobump (p=p@entry=0x7ff35f814470, need=need@entry=0, objv=0x7ff2aefd56f0, nobj=nobj@entry=0, fcalls=4000) at beam/erl_gc.c:882
#11 0x00000000004671e2 in scheduler_gc_proc (reds_left=4000, c_p=0x7ff35f814470) at beam/erl_process.c:9234
#12 erts_schedule (esdp=<optimized out>, p=0x7ff35f814470, calls=<optimized out>) at beam/erl_process.c:9961
(gdb) p p->sig_qs
$1 = {first = 0x0, last = 0x7ff35f8145b0, save = 0x7ff35f8145b0, cont = 0x7ff37374ea00, cont_last = 0x7ff37373f1f0, nmsigs = {next = 0x7ff37379e9e0, last = 0x7ff3737ab940}, recv_mrk_blk = 0x0,
len = 48, flags = 2
(gdb) etp-process-info p
Pid: <0.9823.2>
State: sig-q | active-sys | sig-in-q | garbage-collecting | running | active | prq-prio-normal | usr-prio-normal | act-prio-normal
Flags: trap-exit using-db timo
Current function: unknown
I: #Cp<gen_server:loop/7+0x260>
Heap size: 75113
Old-heap size: 514838
Mbuf size: 12306
Msgq len: 142 (inner=48, outer=94)
When p->sig_qs.nmsigs.next contains prepend signal. I did not spend enough time to fully understand how prepend signal copies the message, so cannot pinpoint the cause yet.
I'll have a look at the prepend signal issue, but I don't think I'll have time for it this week.
Describe the bug Something (allegedly persistent_term:put/2) causes all normal schedulers to prevent RAM deallocs. Which leads to BEAM OOM with erl_crash.dump "Slogan: eheap_alloc: Cannot allocate 972288 bytes of memory (of type "heap")." All normal schedulers have sleep aux work is "DD_THR_PRGR | THR_PRGR_LATER_OP | MISC_THR_PRGR" (plus some more usual flags for those active).
To Reproduce This bug reproduces after several days of uptime. In the crash dump it's clearly visible that one process is returning from
persistent_term:put/2
(current function is bif_return_trap). There are more processes in theupdate_queue
(some waiting to execute put/2, and some to execute get/0).Affected versions OTP 24RC1. It is likely that bug was planted after October 2020, as earlier builds (including beamasm) do not seem to exhibit this problem (or it is less likely to trigger).
Additional context Core dump taken reveals some more details:
persistent_term:put({atom, another}, 2)
- so key is tuple of two atoms, value is immediate value (small int)table_updater
(scheduled with erts_schedule_thr_prgr_later_op here: https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_bif_persistent.c#L433), soupdater_process
is set to that process (which has both "suspended" and "active" flags)Aux work Flags: delayed-dealloc delayed-dealloc-thr-prgr fix-alloc-dealloc fix-alloc-lower-lim later-op canceled-timers canceled-timers-thr-prgr async-ready misc set-tmo yield GARBAGE
with run queue flagsMisc Flags: inactive non-empty exec