Open gmazurki opened 7 months ago
I started the OpenOCD session and then restarted the board. It stopped booting at the “Advanced Linux Sound Architecture Driver Initialized." message. I opened OpenOCD console and typed few command:
> reg pc
pc (/32): 0x814e0c42
> targets
TargetName Type Endian TapName State
-- ------------------ ---------- ------ ------------------ ------------
0* arc-em.lpu0 arcv3 little arc-em.lpu0 running
1 arc-em.lpu1 arcv3 little arc-em.lpu1 running
2 arc-em.lpu2 arcv3 little arc-em.lpu2 running
> reg pc
pc (/32): 0x814e0c42
>
>
> reg pc
pc (/32): 0x814e0c42
> targets 1
use 'arc-em.lpu1' as target identifier, not '1'
> reg pc
pc (/32): 0x81aa5b82
>
The last command (reg pc
on the CPU 1) caused Linux to continue booting.
I was able to gather the CPU trace (smart trace). Interestingly, reading AUX regs needed to gather the trace didn’t cause the Linux to continue, unlike “reg pc“ command.
The last entry in the smart trace is:
entry 0:
SRC = 0x814E0C42
DST = 0x814E0C3E
FLAGS = 0x80000400
and corresponds to the following instructions from the vmlinux listing assembler code (function multi_cpu_stop
):
814e0c34: 272f 00bf clri r2
814e0c38: 236f 10ff dmb 0x3
814e0c3c: 4082 mov_s r0,r20
814e0c3e: 202f 048c ex r0,[r18] <<<< DST
814e0c42: 08ff 8050 breq r0,0x1,-2 ;814e0c3e <multi_cpu_stop+0xa2> <<<< SRC
814e0c46: 236f 10ff dmb 0x3
814e0c4a: 4103 mov_s r1,r24
814e0c4c: 212f 0430 atld.add r1,[r16]
FLAGS = 0x80000400 means that the bit repeat
(0x400) is set, so the CPU spins in the loop.
An ex
instruction is one of the “Atomic Memory Operations (AMOs) - 32 bit“.
Could it be a prove that this is some kind of a deadlock?
I'm attaching example Linux defconfig that is used in our tests.
I attach SmaRT trace for all cores, as read by the Metaware Debugger
Thanks Jakub,
Looks like the other 2 cores stopped execution somewhere in the function rcu_dynticks_inc:
81aa5b28 <rcu_dynticks_inc>:
81aa5b28: 1cf8 b38e std.aw r14r15,[sp,-8]
81aa5b2c: c0f1 push_s blink
81aa5b2e: 46cb 8143 c680 mov_s r14,0x8143c680
81aa5b34: 0be6 0020 bl.d 996 ;81aa5f18 <debug_smp_processor_id>
81aa5b38: 4708 mov_s r15,r0
81aa5b3a: 41c3 81d3 eb84 mov_s r1,0x81d3eb84
81aa5b40: 2655 194e add2 r14,r14,0x25
81aa5b44: 21f0 0000 ld.as r0,[r1,r0]
81aa5b48: 661e add_s r14,r14,r0
81aa5b4a: 236f 10ff dmb 0x3
81aa5b4e: 272f 00ff clri r3
81aa5b52: 236f 10ff dmb 0x3
81aa5b56: 42c3 81d1 609c mov_s r2,0x81d1609c
81aa5b5c: 710c mov_s r0,1
81aa5b5e: 202f 008c ex r0,[r2]
81aa5b62: 08ff 8050 breq r0,0x1,-2 ;81aa5b5e <rcu_dynticks_inc+0x36> <<<<<<<<<<<<<<<<< core2
81aa5b66: 236f 10ff dmb 0x3
81aa5b6a: 40e1 mov_s r0,r15
81aa5b6c: 202f 03b0 atld.add r0,[r14]
81aa5b70: 2000 03c0 add r0,r0,r15
81aa5b74: 236f 10ff dmb 0x3
81aa5b78: 702c mov_s r1,0
81aa5b7a: 212f 008c ex r1,[r2]
81aa5b7e: 236f 10ff dmb 0x3
81aa5b82: 262f 00ff seti r3 <<<<<<<<<<<<<<<<< core1
81aa5b86: 236f 10ff dmb 0x3
81aa5b8a: c0d1 pop_s blink
81aa5b8c: 7fe0 j_s.d [blink]
81aa5b8e: 1408 358e ldd.ab r14r15,[sp,8]
81aa5b92: 78e0 nop_s
I added defconfig options for additional checking of the locking mechanisms:
CONFIG_TEST_LOCKUP=m
CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RAW_LOCK_NESTING=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
The boot log: linux_boot_log_selftest.txt
The boot stalls after Advanced Linux Sound Architecture Driver Initialized.
message.
The tests seem to pass: Good, all 391 testcases passed! |
This part shows some errors:
unwinder looping too long, aborting !
unwinder looping too long, aborting !
unwinder looping too long, aborting !
unwinder looping too long, aborting !
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.127 #10
Stack Trace:
arc_unwind_core+0xf4/0x110
dump_stack_lvl+0x4c/0x74
register_lock_class+0x7f0/0x7f4
__lock_acquire+0x68/0x2ae8
lock_acquire+0xfc/0x34c
_raw_spin_lock_irqsave+0x36/0x70
complete+0x16/0x40
rest_init+0x118/0x21c
0x8158ffa4
Another symptom of the issue is that with ATLD the rcutorture module shows "process hang" warning. With previous LLSC option the rcutorture was working fine.
# modprobe rcutorture
rcu-torture:--- Start of test: nreaders=2 nfakewriters=4 stat_interval=60 verbose=1 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 shutdown_secs=0 stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 stall_cpu_block=0 n_barrier_cbs=0 onoff_interval=0 onoff_holdoff=0 read_exit_delay=13 read_exit_burst=16 nocbs_nthreads=0 nocbs_toggle=1000
rcu: Start-test grace-period state: g26265 f0x0
rcu_torture_write_types: Testing conditional GPs.
rcu_torture_write_types: Testing expedited GPs.
rcu_torture_write_types: Testing asynchronous GPs.
rcu_torture_write_types: Testing polling GPs.
rcu_torture_write_types: Testing normal GPs.
rcu-torture: Creating rcu_torture_writer task
rcu-torture: Creating rcu_torture_fakewriter task
rcu-torture: rcu_torture_writer task started
rcu-torture: Creating rcu_torture_fakewriter task
rcu-torture: rcu_torture_fakewriter task started
rcu-torture: Creating rcu_torture_fakewriter task
rcu-torture: rcu_torture_fakewriter task started
rcu-torture: Creating rcu_torture_fakewriter task
rcu-torture: rcu_torture_fakewriter task started
rcu-torture: Creating rcu_torture_reader task
rcu-torture: rcu_torture_fakewriter task started
rcu-torture: Creating rcu_torture_reader task
rcu-torture: rcu_torture_reader task started
rcu-torture: Creating rcu_torture_stats task
rcu-torture: rcu_torture_reader task started
rcu-torture: Creating torture_shuffle task
rcu-torture: rcu_torture_stats task started
rcu-torture: Creating torture_stutter task
rcu-torture: torture_shuffle task started
rcu-torture: torture_stutter task started
rcu-torture: Creating rcu_torture_fwd_prog task
rcu-torture: Creating rcu_torture_read_exit task
rcu-torture: rcu_torture_fwd_progress task started
rcu-torture: rcu_torture_read_exit: Start of test
#
#
#
# rcu-torture: rcu_torture_read_exit: End of episode
rcu-torture: rcu_torture_read_exit: Start of episode
rcu-torture: rcu_torture_read_exit: End of episode
INFO: task torture_stutter:2207 blocked for more than 10 seconds.
Tainted: G O 5.15.127 #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:torture_stutter state:D stack: 0 pid: 2207 ppid: 2 flags:0x00000000
Stack Trace:
__switch_to+0x0/0xa8
__schedule+0x2c8/0x6c4
schedule+0x50/0xf0
schedule_hrtimeout_range_clock+0xd8/0x138
torture_stutter+0x15a/0x1c0 [torture]
kthread+0x10e/0x144
ret_from_fork+0x14/0x18
rcu-torture: rcu_torture_read_exit: Start of episode
rcu-torture: rcu_torture_read_exit: End of episode
INFO: task torture_stutter:2207 blocked for more than 10 seconds.
Tainted: G O 5.15.127 #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:torture_stutter state:D stack: 0 pid: 2207 ppid: 2 flags:0x00000000
Stack Trace:
__switch_to+0x0/0xa8
__schedule+0x2c8/0x6c4
schedule+0x50/0xf0
schedule_hrtimeout_range_clock+0xd8/0x138
torture_stutter+0x15a/0x1c0 [torture]
kthread+0x10e/0x144
ret_from_fork+0x14/0x18
rcu-torture: rcu_torture_read_exit: Start of episode
rcu-torture: rcu_torture_read_exit: End of episode
#
#
# rcu-torture: rtc: 36277d34 ver: 705 tfle: 0 rta: 706 rtaf: 0 rtf: 694 rtmbe: 0 rtmbkf: 0/537 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 3520 barrier: 0/0:0 read-exits: 67 nocb-toggles: 0:0
rcu-torture: Reader Pipe: 16053121 784 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch: 16051820 2085 0 0 0 0 0 0 0 0 0
rcu-torture: Free-Block Circulation: 705 704 703 700 699 698 697 696 695 694 0
rcu-torture: rcu_torture_read_exit: Start of episode
rcu-torture: rcu_torture_read_exit: End of episode
#
#
#
#
#
# INFO: task torture_stutter:2207 blocked for more than 10 seconds.
Tainted: G O 5.15.127 #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:torture_stutter state:D stack: 0 pid: 2207 ppid: 2 flags:0x00000000
Stack Trace:
__switch_to+0x0/0xa8
__schedule+0x2c8/0x6c4
schedule+0x50/0xf0
schedule_hrtimeout_range_clock+0xd8/0x138
torture_stutter+0x15a/0x1c0 [torture]
kthread+0x10e/0x144
ret_from_fork+0x14/0x18
#
# rcu-torture: rcu_torture_read_exit: Start of episode
rcu-torture: rcu_torture_read_exit: End of episode
#
#
#
# rcu_torture_fwd_prog_nr: Duration 1431 cver 240 gps 455
rcu_torture_fwd_prog_cr Duration 12 barrier: 6 pending 37078 n_launders: 87940 n_launders_sa: 8925 n_max_gps: 100 n_max_cbs: 64339 cver 0 gps 9
rcu_torture_fwd_cb_hist: Callback-invocation histogram (duration 19 jiffies): 1s/10: 79016:6 2s/10: 73263:4
rcu-torture: rcu_torture_read_exit: Start of episode
rcu-torture: rcu_torture_read_exit: End of episode
INFO: task torture_stutter:2207 blocked for more than 10 seconds.
Tainted: G O 5.15.127 #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:torture_stutter state:D stack: 0 pid: 2207 ppid: 2 flags:0x00000000
Stack Trace:
__switch_to+0x0/0xa8
__schedule+0x2c8/0x6c4
schedule+0x50/0xf0
schedule_hrtimeout_range_clock+0xd8/0x138
torture_stutter+0x15a/0x1c0 [torture]
kthread+0x10e/0x144
ret_from_fork+0x14/0x18
rcu-torture: rcu_torture_read_exit: Start of episode
rcu-torture: rcu_torture_read_exit: End of episode
INFO: task torture_stutter:2207 blocked for more than 20 seconds.
Tainted: G O 5.15.127 #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:torture_stutter state:D stack: 0 pid: 2207 ppid: 2 flags:0x00000000
Stack Trace:
__switch_to+0x0/0xa8
__schedule+0x2c8/0x6c4
schedule+0x50/0xf0
schedule_hrtimeout_range_clock+0xd8/0x138
torture_stutter+0x15a/0x1c0 [torture]
kthread+0x10e/0x144
ret_from_fork+0x14/0x18
rcu-torture: rcu_torture_read_exit: Start of episode
rcu-torture: rcu_torture_read_exit: End of episode
#
#
#
#
#
#
#
# rmmod rcutorture
rcu-torture: Stopping torture_shuffle task
rcu-torture: Stopping rcu_torture_reader
rcu-torture: Stopping rcu_torture_reader
rcu-torture: Stopping torture_shuffle
rcu-torture: Stopping torture_stutter task
rcu-torture: Stopping torture_stutter
rcu: rcu_preempt: wait state: RCU_GP_WAIT_FQS(5) ->state: 0x402 ->rt_priority 0 delta ->gp_start 1 ->gp_activity 1 ->gp_req_activity 1 ->gp_wake_time 1 ->gp_wake_seq 39469 ->gp_seq 39473 ->gp_seq_needed 39480 ->gp_max 33 ->gp_flags 0x0
rcu-torture: Stopping rcu_torture_fakewriter
rcu: rcu_node 0:2 ->gp_seq 39473 ->gp_seq_needed 39480 ->qsmask 0x0 .... ->n_boosts 0
rcu: cpu 0 ->gp_seq_needed 39480
rcu: cpu 1 ->gp_seq_needed 39480
rcu: cpu 2 ->gp_seq_needed 39480
rcu: RCU callbacks invoked since boot: 278378
rcu_tasks: RTGS_WAIT_CBS(11) since 131868 g:0 i:0/0 k.
rcu_tasks_rude: RTGS_WAIT_CBS(11) since 131868 g:0 i:0/0 k.
rcu_tasks_trace: RTGS_WAIT_CBS(11) since 131868 g:0 i:0/0 k. N0 h:0/0/0
rcu-torture: Stopping rcu_torture_fakewriter
rcu-torture: Stopping rcu_torture_writer
rcu-torture: Stopping rcu_torture_fakewriter
rcu-torture: rtc: 00000000 VER: 1631 tfle: 0 rta: 1631 rtaf: 0 rtf: 1621 rtmbe: 0 rtmbkf: 0/1175 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 7924 barrier: 0/0:0 read-exits: 152 nocb-toggles: 0:0
rcu-torture: Reader Pipe: 34942090 1744 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch: 34939104 4728 0 0 0 0 0 0 0 0 0
rcu-torture: Free-Block Circulation: 1630 1630 1629 1628 1627 1625 1624 1623 1622 1621 0
rcu-torture: Stopping rcu_torture_stats
rcu-torture: Stopping rcu_torture_fakewriter
rcu-torture: Stopping rcu_torture_read_exit
rcu-torture: Stopping rcutorture_read_exit task
rcu-torture: Stopping rcu_torture_fwd_prog task
rcu_torture_fwd_prog: tested 1 tested_tries 2
rcu-torture: Stopping rcu_torture_fwd_prog
rcu-torture: Stopping rcu_torture_writer task
rcu-torture: Stopping rcu_torture_reader task
rcu-torture: Stopping rcu_torture_reader task
rcu-torture: Stopping rcu_torture_fakewriter task
rcu-torture: Stopping rcu_torture_fakewriter task
rcu-torture: Stopping rcu_torture_fakewriter task
rcu-torture: Stopping rcu_torture_fakewriter task
rcu: End-test grace-period state: g39505 f0x0 total-gps=3310
rcu-torture: Stopping rcu_torture_stats task
mem_dump_obj() slab test: rcu_torture_stats = 00000000, &rhp = 86707f08, rhp = 880567f8, &z = ac1d9580
mem_dump_obj(ZERO_SIZE_PTR): zero-size pointer
mem_dump_obj(NULL): NULL pointer
mem_dump_obj(86707f08): non-slab/vmalloc memory
mem_dump_obj(880567f8): slab rcuscale start 880567f8 pointer offset 0
mem_dump_obj(880567fc): slab rcuscale start 880567f8 pointer offset 4
mem_dump_obj(ac1d9580): 11-page vmalloc region starting at 0xac1d1000 allocated at load_module+0xa1e/0x1ff8
mem_dump_obj() kmalloc test: rcu_torture_stats = 00000000, &rhp = 86707f08, rhp = 87533680
mem_dump_obj(kmalloc 87533680): slab kmalloc-128 start 87533680 pointer offset 0 size 128
mem_dump_obj(kmalloc 87533684): slab kmalloc-128 start 87533680 pointer offset 4 size 128
mem_dump_obj() vmalloc test: rcu_torture_stats = 00000000, &rhp = 86707f08, rhp = ac02c000
mem_dump_obj(vmalloc ac02c000): 1-page vmalloc region starting at 0xac02c000 allocated at rcu_torture_cleanup+0x49a/0x6b0 [rcutorture]
mem_dump_obj(vmalloc ac02c004): 1-page vmalloc region starting at 0xac02c000 allocated at rcu_torture_cleanup+0x49a/0x6b0 [rcutorture]
rcu-torture: rtc: 00000000 VER: 1631 tfle: 0 rta: 1631 rtaf: 0 rtf: 1621 rtmbe: 0 rtmbkf: 0/1175 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 7924 barrier: 0/0:0 read-exits: 152 nocb-toggles: 0:0
rcu-torture: Reader Pipe: 34942090 1744 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch: 34939104 4728 0 0 0 0 0 0 0 0 0
rcu-torture: Free-Block Circulation: 1630 1630 1629 1628 1627 1625 1624 1623 1622 1621 0
rcu-torture:--- End of test: SUCCESS: nreaders=2 nfakewriters=4 stat_interval=60 verbose=1 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 shutdown_secs=0 stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 stall_cpu_block=0 n_barrier_cbs=0 onoff_interval=0 onoff_holdoff=0 read_exit_delay=13 read_exit_burst=16 nocbs_nthreads=0 nocbs_toggle=1000
#
#
#
I tried reproducing on nSim (multicore), but no luck so far.
I'm not sure how the mdb was able to read the smart trace data from all the cores, because from my observation if I read (using OpenOCD) SMART_CONTROL or SMART_DATA registers from core 1 or 2 the Linux immediately resumes booting. It does not have any effect only with core 0. So using OpenOCD I can grab only smart trace from core 0.
Reading status32 register on any core does not resume the Linux boot.
> reg status32
status32 (/32): 0x00080002
> targets 1
use 'arc-em.lpu1' as target identifier, not '1'
> reg status32
status32 (/32): 0x00080002
> targets 2
use 'arc-em.lpu2' as target identifier, not '2'
> reg status32
status32 (/32): 0x00080002
> targets 0
use 'arc-em.lpu0' as target identifier, not '0'
> reg pc
pc (/32): 0x814e0c42
> targets 2
use 'arc-em.lpu2' as target identifier, not '2'
> reg pc
pc (/32): 0x81aa5b82
>
After the last command on core 2 the CPUs were resumed.
Reading AUX regs 0x4, 0x6 (IDENTITY, PC) even on core 0 resumes the CPUs. Reading AUX reg 0xA (STATUS32) on any core does not have any effect - the cluster is stalled.
My suspicion is that OpenOCD in case of some commands triggers another read or write, which has the effect described above.
There is the hypothesis that it could be a lockup in the CPU pipeline and JTAG read/write fills/flushes the pipeline and unlocks the CPUs. This hypothesis is still valid.
@abrodkin @xxkent are you able to reproduce it on your side on your FPGA system?
I ran Linux on HAPS and in nSIM with options you provided: CONFIG_TEST_LOCKUP=m CONFIG_PROVE_LOCKING=y CONFIG_PROVE_RAW_LOCK_NESTING=y CONFIG_LOCK_STAT=y CONFIG_DEBUG_LOCKDEP=y CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
In both cases there is "Stack Trace message" during boot, this means that this is SW issue:
Stack Trace:
arc_unwind_core+0xf4/0x110
dump_stack_lvl+0x4e/0x78
register_lock_class+0x630/0x634
__lock_acquire+0x68/0x1bfc
lock_acquire+0xee/0x32c
_raw_spin_lock_irqsave+0x40/0x78
complete+0x16/0x40
rest_init+0x118/0x21c
0x81298eac
Running RCU-tasks wait API self tests
Linux_boot_HAPS.txt Linux_boot_nSIM.txt
It is quite strange that the deadlock test runs on only core_0 while core_1 and core_2 are in the boot loop. Thus, core_1 and core_2 don't participate in this test.
I will conduct this test with your hs58_defconfig.txt as the next step.
The issue (Linux occasionally does not boot with ATLD implementation of atomics) is not reproduced on HAPS. I see some other issue with boot, but it is most likely SW because it exists in nSIM too.
Thanks for checking this. How many resets have you performed? In our case repro rate is not 100%, so to be sure you will need to perform tens of resets. Have you enabled similar defconfig options as per the defconfig I attached? I can do the opposite - to cut down the defconfig and see if there is any change.
I took exactly config you provided (hs58_defconfig.txt) and changed only CONFIG_LINUX_LINK_BASE=0x0, CONFIG_LINUX_RAM_BASE=0x0, CONFIG_LINUX_MAP_SIZE=0x60000000 and CONFIG_ARC_BUILTIN_DTB_NAME=haps_hs5x_idu.
I did about 6 attempts on HAPS. Ok, then I'll be running this throughout the day to get more attempts and if anything changes, I'll let you know.
A suggestion from Alexey was to check if with ATLD selected in menuconfig there are still ‘llock’ and ‘scond’ instructions in the listing. The result is no, there are none - grep'ing for these finds nothing
>grep --ignore-case llock vmlinux.lst
>grep --ignore-case scond vmlinux.lst
>
This issue is not reproducible with single-core SMP Linux.
I attach SmaRT trace for all cores, as read by the Metaware Debugger
Could you provide call stacks for each core from MDB?
Try this patch also please: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/commit/28e6344920c628041a9447472dbf1ce3c2c5834f
Hi @xxkent, I was testing issue https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/168 with the https://github.com/foss-for-synopsys-dwc-arc-processors/linux/commit/28e6344920c628041a9447472dbf1ce3c2c5834f patch applied but the sshd was freezing with the following error:
INFO: task sshd:208 blocked for more than 10 seconds.
Tainted: G O 5.15.127 #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:sshd state:D stack: 0 pid: 208 ppid: 206 flags:0x00000001
Stack Trace:
__switch_to+0x0/0xa8
__schedule+0x272/0x608
schedule+0x50/0xf0
schedule_timeout+0x88/0xb0
ldsem_down_read+0x128/0x238
tty_ldisc_ref_wait+0x18/0x38
tty_poll+0x2e/0xb4
do_sys_poll+0x1f2/0x454
sys_ppoll+0x76/0xb0
EV_Trap+0xb8/0xbc
However, this problem can be reproduced on ssh running on localhost. Steps:
ssh localhost
,uname
,You can try reproducing it on your side, maybe it will help in solving the bugs.
The good news is that with patch https://github.com/foss-for-synopsys-dwc-arc-processors/linux/commit/28e6344920c628041a9447472dbf1ce3c2c5834f I no longer can see boot-time freeze, tried over 2k times.
Could you provide call stacks for each core from MDB?
Yes, here new capture of SmaRT + call stacks for each core: callstack-core0.txt callstack-core1.txt callstack-core2.txt smart-core0.txt smart-core1.txt smart-core2.txt
This time I reproduced this by loading and starting vmlinux from MDB, result was the same.
Hi @jzbydniewski!
Thank you for this output it is very valuable. For clearer picture can you gather some additional information? Please provide a SmaRT trace and registry dump of all cores for a few additional runs to ensure we are in the same state every time we hang.
p.s. unfortunately commit https://github.com/foss-for-synopsys-dwc-arc-processors/linux/commit/28e6344920c628041a9447472dbf1ce3c2c5834f can't be used as a final solution of the described issue, additional changes may be required due to https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/167#issuecomment-2098344165 and moreover it may hide a real problem with CPU.
Hi @jzbydniewski!
Thank you for this output it is very valuable. For clearer picture can you gather some additional information? Please provide a SmaRT trace and registry dump of all cores for a few additional runs to ensure we are in the same state every time we hang.
Logs for next 5 reproductions: logs.zip
I did two experiments with enabled hw counters, as below:
hwc count cond=iall cond=crun cond=illock cond=iscond cond=imematomic cond=dbgflush cond=bstall cond=always
1st experiment: issue reproduce, I stopped cores in mdb after <1s:
Core 0:
Core 1:
Core 2:
2nd experiment: issue reproduce, I stopped cores in mdb after ~10s:
Core 0
Core 1:
Core 2:
Regards, Jakub
Should we also force ATLD setting (I don't see that option in the version I'm currently using) in buildroot menuconfig, not only linux menuconfig ?
I can see that resuming cores with use of MDB works (and Linux continues executing) only if I do not pass param
-off=flush_dcache
With this param passed I can see cores are still stuck (!) after I resume.
Looking at data caches and memory (still with -off=flush_dcache option set), I can see that address 0x81d6609c (const across reproductions with the same vmlinux), that seems to be used with atomic exchange "ex" instruction in core 1 and core 2, is 0x0 in memory, but 0x01 in core 2 data cache. Could this be causing this issue ?
cache contents: dcache-core1.txt dcache-core2.txt dcache-core3.txt l2-cache.txt
SmaRT: smart-core1.txt smart-core2.txt smart-core3.txt
regs: regs-core1.txt regs-core2.txt regs-core3.txt
I did a simple experiment in that state, simply changed the value at 0x81d6609c to 1 and resumed cores (so kind of manual flush) - as a result Linux booted to the console.
Conduct please the following experiment:
To patch automatically you can add theses commands to your commands.txt file: evalq ((int )0xAAAAAAAA)=0x91111111 evalq ((int )0xBBBBBBBB)=0xA2222222
As the next step, if it won't help, you can also do the same for multi_cpu_stop() and _raw_spin_unlock_irqrestore().
@xxkent I tried this for the rcu_dynticks_inc() at the first place,
evalq *((int *)0x81af5b5e)=0x808c202f
evalq *((int *)0x81af5b7a)=0x808c212f
As a result I am getting crash
smp: Bringing up secondary CPUs ...
Idle Task [1] (ptrval)
Trying to bring up CPU1 ...
IDENTITY : ARCVER [0x60] ARCNUM [0x1] CHIPID [ 0x0]
processor [1] : HS58 (ARC32 ARCv3 ISA) Dual-Issue
ISA Extn : atomic (not used) atomic2 ll128 (not used) unalign mpy[opt 9] div64
MMU [v10] : MMU32 hwalk 3 levels, 4k PAGE, JTLB 1024 uD/I 8/4
tlb_flush_mm Incr ASID
I-Cache : 64K, 4way/set, 64B Line, VIPT aliasing
D-Cache : 64K, 2way/set, 64B Line, PIPT
HW PF : RD 2 WR 2 OUTS 2 AG 1 (disabled)
L2 : 768K, 64B Line
Timers : Timer0 Timer1 RTC [UP 64-bit] GFRC [SMP 64-bit]
Vector Table : 0x8050000
Extn [SMP] : ARConnect (v3): 3 cores with IPI IDU DEBUG GFRC
archs-intc : 3 priority levels (default 1)
## CPU1 LIVE ##: Executing Code...
Invalid Mem Access
Invalid Mem Access
Path: (null)
Path: (null)
Idle Task [2] (ptrval)
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.127 #7
Trying to bring up CPU2 ...
Bus Error from Data Mem
IDENTITY : ARCVER [0x60] ARCNUM [0x2] CHIPID [ 0x0]
processor [2] : HS58 (ARC32 ARCv3 ISA) Dual-Issue
ISA Extn : atomic (not used) atomic2 ll128 (not used) unalign mpy[opt 9] div64
MMU [v10] : MMU32 hwalk 3 levels, 4k PAGE, JTLB 1024 uD/I 8/4
tlb_flush_mm Incr ASID
I-Cache : 64K, 4way/set, 64B Line, VIPT aliasing
D-Cache : 64K, 2way/set, 64B Line, PIPT
HW PF : RD 2 WR 2 OUTS 2 AG 1 (disabled)
L2 : 768K, 64B Line
Timers : Timer0 Timer1 RTC [UP 64-bit] GFRC [SMP 64-bit]
Vector Table : 0x8050000
Extn [SMP] : ARConnect (v3): 3 cores with IPI IDU DEBUG GFRC
archs-intc : 3 priority levels (default 1)
ECR: 0x00011000 EFA: 0x81d6609c ERET: 0x81af5b62
STAT: 0x00080002
## CPU2 LIVE ##: Executing Code...
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.127 #7
Bus Error from Data Mem
ECR: 0x00011000 EFA: 0x81d6609c ERET: 0x81af5b62
STAT: 0x00080002 [ K ] BTA: 0x81af5b3a
SP: 0x81cb5fb4 FP: 0x00000000 BLK: rcu_dynticks_inc+0x12/0x6c
r00: 0x00000001 r01: 0x81d8eb84 r02: 0x81d6609c
r03: 0x00000021 r04: 0x8e778000 r05: 0x00000006
r06: 0x00000008 r07: 0x98968000 r08: 0xffffffff
r09: 0xfffffff7 r10: 0x00000000 r11: 0x00000004
r12: 0xab815588
Stack Trace:
rcu_dynticks_inc+0x3a/0x6c
default_idle_call+0x1a/0x6c
do_idle+0x80/0xfc
cpu_startup_entry+0x14/0x18
start_kernel+0x4a4/0x4c0
Invalid Mem Access
Path: (null)
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.127 #7
Bus Error from Data Mem
ECR: 0x00011000 EFA: 0x81d6609c ERET: 0x81af5b62
STAT: 0x00080002 [ K ] BTA: 0x81af5b3a
SP: 0x81cb5fb4 FP: 0x00000000 BLK: rcu_dynticks_inc+0x12/0x6c
r00: 0x00000001 r01: 0x81d8eb84 r02: 0x81d6609c
r03: 0x00000021 r04: 0x8e778000 r05: 0x00000006
r06: 0x00000008 r07: 0x98968000 r08: 0xffffffff
r09: 0xfffffff7 r10: 0x00000000 r11: 0x00000004
r12: 0xab815588
Stack Trace:
rcu_dynticks_inc+0x3a/0x6c
default_idle_call+0x1a/0x6c
do_idle+0x80/0xfc
cpu_startup_entry+0x14/0x18
start_kernel+0x4a4/0x4c0
[ K ] BTA: 0x81af5b3a
SP: 0x82065fd8 FP: 0x00000000 BLK: rcu_dynticks_inc+0x12/0x6c
r00: 0x00000001 r01: 0x81d8eb84 r02: 0x81d6609c
r03: 0x00000021 r04: 0x270e0000 r05: 0x00000007
r06: 0x00000008 r07: 0x98968000 r08: 0xffffffff
r09: 0xfffffff7 r10: 0x00000000 r11: 0x00000004
r12: 0x81affe98
Stack Trace:
rcu_dynticks_inc+0x3a/0x6c
default_idle_call+0x1a/0x6c
do_idle+0x80/0xfc
cpu_startup_entry+0x14/0x18
first_lines_of_secondary+0x0/0x38
Invalid Mem Access
Path: (null)
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.15.127 #7
Bus Error from Data Mem
ECR: 0x00011000 EFA: 0x81d6609c ERET: 0x81af5b62
STAT: 0x00080002 [ K ] BTA: 0x81af5b3a
SP: 0x82067fd8 FP: 0x00000000 BLK: rcu_dynticks_inc+0x12/0x6c
r00: 0x00000001 r01: 0x81d8eb84 r02: 0x81d6609c
r03: 0x00000021 r04: 0xbfa48000 r05: 0x00000007
r06: 0x00000008 r07: 0x98968000 r08: 0xffffffff
r09: 0xfffffff7 r10: 0x00000000 r11: 0x00000004
r12: 0x81affe98
Stack Trace:
rcu_dynticks_inc+0x3a/0x6c
default_idle_call+0x1a/0x6c
do_idle+0x80/0xfc
cpu_startup_entry+0x14/0x18
first_lines_of_secondary+0x0/0x38
Thanks @jzbydniewski. This is quite unexpected for me that access to a global variable with .di gets "BUS FAULT". Let me to discuss with colleagues.
Is also interesting, that when I tried to apply the same for other function multi_cpu_stop
I got unexpected result in dissasembly
evalq *((int *)0x81530c3e)=0xa02f048c
while memory window shows expected value..
This behavior is predictable since the BUS don't support atomic operation, means that we can't bypass this issue such way using .di.
evalq ((int )0x81530c3e)=0xa02f048c
You swapped 0,1 bytes with 2,3. Disassembly shows bytes 0 and 1 in memory as high bytes and 2 and 3 as low bytes.
It looks that with CONFIG_ARC_HAS_LL64=n this issue is not showing up, but that might be just due to different timing I guess.
After switching to ATLD atomics we see occasional hangs while the Linux boots. The last message that is seen on the console is:
Advanced Linux Sound Architecture Driver Initialized.
The behavior was first described here: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/162
If I attach OpenOCD JTAG to the LPU the Linux continues booting.
For 2177 resets the issue happened 626 times, so repro rate is quite high 28%.
This seems to be buildroot (Linux) build specific, as this is quite easily reproducible with one buildroot build, but didn’t happen with another one (with changes not related to this issue).