foss-for-synopsys-dwc-arc-processors / linux

Helpful resources for users & developers of Linux kernel for ARC
22 stars 13 forks source link

Multicore HS58 based system hangs while running scftorture kernel module #162

Closed gmazurki closed 3 months ago

gmazurki commented 4 months ago

The scftorture kernel module causes CPU core(s) to hang while executing the torture test.

The module cannot be unloaded, doing so causes rmmod command to hang. Console becomes unresponsive.

Usually it is not possible to execute any command on the console, looks like the CPUs are busy doing the torture test, so no CPU cycle are given to other processes.

I checked on a different platform to see what should we expect from it, and on raspberry pi 4 the module does not generate "BUG: soft lockup" messages and can be loaded/unloaded successfully many times.

We have 3xHS58 system.

Note rcutorture and locktorture modules behave correctly and don't show up any worrying messages. Can be loaded/unloaded any time.

# modprobe scftorture verbose=1
scftorture: --- Start of test:  verbose=1 holdoff=0 longwait=0 nthreads=-1 onoff_holdoff=0 onoff_interval=0 shutdown_secs=0 stat_interval=60 stutter=5 use_cpus_read_lock=0, weight_resched=-1, weight_single=-1, weight_single_rpc=-1, weight_single_wait=-1, weight_many=-1, weight_many_wait=-1, weight_all=-1, weight_all_wait=-1
scftorture: !!! built as module, weight_resched ignored
scf_sel_dump:  25.000 smp_call_function_single(nowait)
scf_sel_dump:  25.000 smp_call_function_single_rpc(wait)
scf_sel_dump:  25.000 smp_call_function_single(wait)
scf_sel_dump:   8.333 smp_call_function_many(nowait)
scf_sel_dump:   8.333 smp_call_function_many(wait)
scf_sel_dump:   4.166 smp_call_function(nowait)
scf_sel_dump:   4.166 smp_call_function(wait)
scftorture-torture: Creating torture_stutter task
scftorture: Starting 3 smp_call_function() threads
scftorture-torture: torture_stutter task started
scftorture-torture: Creating scftorture_invoker task
scftorture-torture: Creating scftorture_invoker task
scftorture: scftorture_invoker 0: task started
scftorture: scftorture_invoker 0: Waiting for all SCF torturers from cpu 0
scftorture-torture: Creating scftorture_invoker task
scftorture: scftorture_invoker 1: task started
scftorture: scftorture_invoker 1: Waiting for all SCF torturers from cpu 1
scftorture: scftorture_invoker 1 started
scftorture: scftorture_invoker 0 started
scftorture-torture: Creating scf_torture_stats task
scftorture: scftorture_invoker 2: task started
-torture: scf_torture_stats task started
# scftorture: scftorture_invoker 2: Waiting for all SCF torturers from cpu 2

# scftorture: scftorture_invoker 2 started
scftorture:  scf_invoked_count ver: 15344127 resched: 0 single: 3464513/2968869 single_ofl: 0/0 single_rpc: 2967191 single_rpc_ofl: 0 many: 989424/990883 all: 495992/495478 ste: 0 stnmie: 0 stnmoe: 0 staf: 0

# scftorture:  scf_invoked_count ver: 30342926 resched: 0 single: 6851783/5869314 single_ofl: 0/0 single_rpc: 5868923 single_rpc_ofl: 0 many: 1956173/1959153 all: 980746/980381 ste: 0 stnmie: 0 stnmoe: 0 staf: 0
scftorture:  scf_invoked_count ver: 43989910 resched: 0 single: 9933706/8511279 single_ofl: 0/0 single_rpc: 8509524 single_rpc_ofl: 0 many: 2836833/2839962 all: 1420675/1420231 ste: 0 stnmie: 0 stnmoe: 0 staf: 0
watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [scftorture_invo:215]
Modules linked in: scftorture torture dwc3 dwc3_syna phy_syna_usb bcmdhd_sdio(O)
Path: (null)
CPU: 2 PID: 215 Comm: scftorture_invo Tainted: G           O      5.15.127 #13
Check Programmer's Manual
ECR: 0x823c0100 EFA: 0x00000000 ERET: 0x813092a8
STAT: 0x80080002 [IE K     ]   BTA: 0x00000000
 SP: 0x858b3f0c  FP: 0x00000000 BLK: smp_call_function_single+0xbe/0x19c
r00: 0x00000000 r01: 0x00000011 r02: 0x00000003
r03: 0x00000001 r04: 0x00000000 r05: 0x00000004
r06: 0xabe524f0 r07: 0x0000000f r08: 0x00000001
r09: 0xffffffe7 r10: 0x858b2000 r11: 0x00005400
r12: 0x813091dc
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu:    2-....: (2098 ticks this GP) idle=0df/1/0x40000002 softirq=324/324 fqs=1043
        (t=2101 jiffies g=245 q=2)
Task dump for CPU 2:
task:scftorture_invo state:R  running task     stack:    0 pid:  215 ppid:     2 flags:0x00000008

Stack Trace:
  arc_unwind_core+0xf4/0x110
  rcu_dump_cpu_stacks+0xd2/0xfc
  rcu_sched_clock_irq+0x934/0xb5c
  update_process_times+0xae/0xd8
  tick_sched_handle+0x20/0x54
  tick_sched_timer+0x4e/0x98
  __hrtimer_run_queues+0x110/0x1c8
  hrtimer_interrupt+0xe2/0x250
  timer_irq_handler+0x2c/0x34
  __handle_irq_event_percpu+0x2c/0xe0
  handle_irq_event_percpu+0x16/0x4c
  handle_percpu_irq+0x6c/0x90
  handle_irq_desc+0x3a/0x68
  handle_domain_irq+0x76/0xb4
  handle_interrupt+0x60/0x64
watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [scftorture_invo:215]
Modules linked in: scftorture torture dwc3 dwc3_syna phy_syna_usb bcmdhd_sdio(O)
Path: (null)
CPU: 2 PID: 215 Comm: scftorture_invo Tainted: G           O L    5.15.127 #13
Check Programmer's Manual
ECR: 0x823c0100 EFA: 0x00000000 ERET: 0x813092a8
STAT: 0x80080002 [IE K     ]   BTA: 0x00000000
 SP: 0x858b3f0c  FP: 0x00000000 BLK: smp_call_function_single+0xbe/0x19c
r00: 0x00000000 r01: 0x00000011 r02: 0x00000003
r03: 0x00000001 r04: 0x00000000 r05: 0x00000004
r06: 0xabe524f0 r07: 0x0000000f r08: 0x00000001
r09: 0xffffffe7 r10: 0x858b2000 r11: 0x00005400
r12: 0x813091dc
scftorture:  scf_invoked_count ver: 50146394 resched: 0 single: 11323555/9705070 single_ofl: 0/0 single_rpc: 9702356 single_rpc_ofl: 0 many: 3233962/3236172 all: 1619613/1617960 ste: 0 stnmie: 0 stnmoe: 0 staf: 0
watchdog: BUG: soft lockup - CPU#2 stuck for 78s! [scftorture_invo:215]
Modules linked in: scftorture torture dwc3 dwc3_syna phy_syna_usb bcmdhd_sdio(O)
Path: (null)
CPU: 2 PID: 215 Comm: scftorture_invo Tainted: G           O L    5.15.127 #13
Check Programmer's Manual
ECR: 0x823c0100 EFA: 0x00000000 ERET: 0x813092a8
STAT: 0x80080002 [IE K     ]   BTA: 0x00000000
 SP: 0x858b3f0c  FP: 0x00000000 BLK: smp_call_function_single+0xbe/0x19c
r00: 0x00000000 r01: 0x00000011 r02: 0x00000003
r03: 0x00000001 r04: 0x00000000 r05: 0x00000004
r06: 0xabe524f0 r07: 0x0000000f r08: 0x00000001
r09: 0xffffffe7 r10: 0x858b2000 r11: 0x00005400
r12: 0x813091dc
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu:    2-....: (8399 ticks this GP) idle=0df/1/0x40000002 softirq=324/324 fqs=4165
        (t=8406 jiffies g=245 q=12)
Task dump for CPU 2:
task:scftorture_invo state:R  running task     stack:    0 pid:  215 ppid:     2 flags:0x00000008

Stack Trace:
  arc_unwind_core+0xf4/0x110
  rcu_dump_cpu_stacks+0xd2/0xfc
  rcu_sched_clock_irq+0x934/0xb5c
  update_process_times+0xae/0xd8
  tick_sched_handle+0x20/0x54
  tick_sched_timer+0x4e/0x98
  __hrtimer_run_queues+0x110/0x1c8
  hrtimer_interrupt+0xe2/0x250
  timer_irq_handler+0x2c/0x34
  __handle_irq_event_percpu+0x2c/0xe0
  handle_irq_event_percpu+0x16/0x4c
  handle_percpu_irq+0x6c/0x90
  handle_irq_desc+0x3a/0x68
  handle_domain_irq+0x76/0xb4
  handle_interrupt+0x60/0x64

I looked at some registers to check where the core is spinning, the location in the listing is address 813092a8:

813091dc <smp_call_function_single>:
813091dc:   1cf8 b48e               std.aw  r18r19,[sp,-8]
813091e0:   1cf8 b40e               std.aw  r16r17,[sp,-8]
813091e4:   1cf8 b38e               std.aw  r14r15,[sp,-8]
813091e8:   c0f1                    push_s  blink
813091ea:   2458 31dc               sub2    sp,sp,0x7
813091ee:   2440 33ce               add r14,sp,0xf
813091f2:   4310                    mov_s   r19,r0
813091f4:   710c                    mov_s   r0,1
813091f6:   1600 7184 8189 ec9c     ldd r4r5,[0x8189ec9c]
813091fe:   4230                    mov_s   r18,r1
81309200:   4150                    mov_s   r17,r2
81309202:   2684 1c3f               and r14,r14,-16
81309206:   4070                    mov_s   r16,r3
81309208:   1e00 1106               std r4r5,[r14]
8130920c:   1e02 111e               std.as  r4r5,[r14,2]
81309210:   0b86 d82f               bl.d    -326780 ;812b9594 <preempt_count_add>
81309214:   1e01 1459               st.as   17,[r14,1]
81309218:   2604 770f ffff e000     and r15,0xffffe000,sp
81309220:   8723                    ld_s    r1,[r15,0xc]
81309222:   2941 0140               lsr r0,r1,0x5
81309226:   2615 7000 81b0 04d4     add2    r0,0x81b004d4,r0
8130922e:   8000                    ld_s    r0,[r0,0]
81309230:   2011 8040               btst    r0,r1
81309234:   f216                    beq_s   44  ;81309260 <smp_call_function_single+0x84>
81309236:   206a 0280               lr  r0,[status32]
8130923a:   2842 06c0               asr r0,r0,0x1b
8130923e:   1600 7001 81b0 1f60     ld  r1,[0x81b01f60]
81309246:   2044 0400               and r0,r0,0x10
8130924a:   2005 807e               or.f    0,r0,r1
8130924e:   f40a                    bne_s   20  ;81309260 <smp_call_function_single+0x84>
81309250:   1600 7080 81af 6380     ldb r0,[0x81af6380]
81309258:   204c 8000               cmp r0,0
8130925c:   0100 0001               beq 256 ;8130935c <smp_call_function_single+0x180>
81309260:   1702 1604               ld.as   r4,[r15,2]
81309264:   8702                    ld_s    r0,[r15,0x8]
81309266:   2604 7104 00f0 0000     and r4,0xf00000,r4
8130926e:   2604 7000 000f 0000     and r0,0xf0000,r0
81309276:   8762                    ld_s    r3,[r15,0x8]
81309278:   2384 0004               and r3,r3,256
8130927c:   2405 0004               or  r4,r4,r0
81309280:   2405 80fe               or.f    0,r4,r3
81309284:   f458                    bne_s   176 ;81309334 <smp_call_function_single+0x158>
81309286:   0851 2010               breq    r16,0,80    ;813092d4 <smp_call_function_single+0xf8>
8130928a:   41c1                    mov_s   r1,r14
8130928c:   4062                    mov_s   r0,r19
8130928e:   1e02 1498               st.as   r18,[r14,2]
81309292:   0e76 ffef               bl.d    -396    ;81309104 <generic_exec_single>
81309296:   1e03 1458               st.as   r17,[r14,3]
8130929a:   8621                    ld_s    r1,[r14,0x4]                    <<<<<<<<<< smp_call_function_single+0xbe/0x19c
8130929c:   216d 803e               xbfu.f  0,r1,0
813092a0:   000e 0021               beq.d   14  ;813092ae <smp_call_function_single+0xd2>
813092a4:   4010                    mov_s   r16,r0
813092a6:   8621                    ld_s    r1,[r14,0x4]
813092a8:   216d 803e               xbfu.f  0,r1,0                                             <<<<<<<<<<<<<<<<<< current PC
813092ac:   f5fd                    bne_s   -6  ;813092a6 <smp_call_function_single+0xca>       <<<<<<<<<<<<<<<<< ERET   current PC
813092ae:   236f 107f               dmb 0x1
813092b2:   0fb2 d7ef               bl.d    -327760 ;812b9260 <preempt_count_sub>
813092b6:   710c                    mov_s   r0,1
813092b8:   8702                    ld_s    r0,[r15,0x8]
813092ba:   086f 0010               breq    r0,0,110    ;81309326 <smp_call_function_single+0x14a>
xxkent commented 3 months ago

Hi @gmazurki, This issue is reproduced in nSIM, I'll see what the problem might be.

For loading/unloading modules please check that CONFIG_MODULE_UNLOAD=y in Linux config. I manage to load and unload the scftorture several times until it freezes.

gmazurki commented 3 months ago

Yes, the option CONFIG_MODULE_UNLOAD is set to y in my case.

xxkent commented 3 months ago

@gmazurki may I ask you to test this temporary fix on hardware?

In function: https://elixir.bootlin.com/linux/v5.16.20/source/kernel/time/timekeeping_internal.h#L30

add please "if" check:

static inline u64 clocksource_delta(u64 now, u64 last, u64 mask)
{
    if (now < last)
        return 0;
    return (now - last) & mask;
}

Since I use nSIM and not hardware, I would like to check that in my case and in your case this is the same problem.

gmazurki commented 3 months ago

With the change in the code you proposed I see the same result - hang message + inability to unload the module. I think it would make sense to reproduce it on your FPGA with multicore cluster.

xxkent commented 3 months ago

Understood. The symptoms of launching in nSIM are absolutely the same as you described above but the root cause is different then. Ok, I'll try to get HAPS. Thanks.

xxkent commented 3 months ago

Just in case, nSIM is able to run SMP configurations with cluster simulation. I use HS58x3 config.

gmazurki commented 3 months ago

nSIM is able to run SMP configurations with cluster simulation

Is it possible with the free version of the nSim? I do have multicore configuration (tcf), but only one core starts, rest of them time out, so the Linux runs only on single core.

xxkent commented 3 months ago

For SMP configs you need to use nSIM + MetaWare debugger anyway. Standalone nsimdrv can't run multicore configs. I have never used free nSIM version. I would suggest to raise this question in e-mail and add me to the thread.

xxkent commented 3 months ago

Can you try this config change? CONFIG_ARC_HAS_LLSC is not set CONFIG_ARC_HAS_ATLD=y

gmazurki commented 3 months ago

I changed the config as you suggested. In this configuration the scftorture module works as expected - no "hang" messages and the module can be loaded/unloaded.

However, there is a problem with booting - the kernel hangs in the middle of booting and continues only if JTAG OpenOCD is attached. Maybe it sends resume command?

What does that prove? What is the valid configuration for HS58? LLSC or ATLD?

gmazurki commented 3 months ago

I tried once again with a clean build (I built everything from scratch). Now Linux boots properly without any strange behavior described previously. scftorture module works fine as well.

I think we can close this issue, but it would be worth to explain why this change was needed. Both options have no description in the kernel menuconfig.

EDIT:

I found relevant commits with a bit of explanation: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/commit/da7891b5f2e42e85c5ded4a1588d4d0f2598dfd0 https://github.com/foss-for-synopsys-dwc-arc-processors/linux/commit/bc57bd4fa52b54bb61bd9040e01910341777da23

"LLSC could be not desirable because of llock/scond livelock issue"

xxkent commented 3 months ago

You can use only ATLD and disable LLSC, it is absolutely valid configuration. Those commits you found are exactly our case. We will enable LLSC as soon as it will be valid to use, but for now I will disable this option in the kernel configuration.

gmazurki commented 3 months ago

The issue with hang during booting was most likely caused by the fact that I did an incremental build, most likely not everything was rebuilt. With a clean build we don't see this issue anymore.

xxkent commented 3 months ago

@gmazurki Do you mean even with LLSC=y, no issue after clean build? Anyway ATLD=y LLSC=no is the right config to use now.

gmazurki commented 3 months ago

I referred to ATLD configuration. Initially we experienced the boot issue with ATLD turned on, but with clean build it works OK. So we switched to ATLD now.

EDIT: unfortunately the issue with Linux hanged during booting shows up occasionally and is build specific, i.e. for one buildroot build this does show up (~20 % repro rate) but for another build this does not happen at all. I'll raise another Issue for this specific problem.

srusecki commented 1 month ago

I see somewhat similar issue in rcutorture module. After loading rcutorture and waiting a bit some task hangs and the following message is printed:

# modprobe rcutorture
rcu-torture:--- Start of test: nreaders=2 nfakewriters=4 stat_interval=60 verbose=1 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 shutdown_secs=0 stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 stall_cpu_block=0 n_barrier_cbs=0 onoff_interval=0 onoff_holdoff=0 read_exit_delay=13 read_exit_burst=16 nocbs_nthreads=0 nocbs_toggle=1000
...
rcu-torture: rcu_torture_read_exit: Start of episode
rcu-torture: rcu_torture_read_exit: End of episode
INFO: task torture_stutter:221 blocked for more than 10 seconds.
      Tainted: G           O      5.15.127 #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:torture_stutter state:D stack:    0 pid:  221 ppid:     2 flags:0x00000000

Stack Trace:
  __switch_to+0x0/0xa8
  __schedule+0x286/0x630
  schedule+0x50/0xf0
  schedule_hrtimeout_range_clock+0xd8/0x138
  torture_stutter+0x15a/0x1c0 [torture]
  kthread+0xe8/0x120
  ret_from_fork+0x14/0x18

This problem is not solved by using ATLD. I'm not sure if this bug should be reopened or I should raise a separate issue.